Statistical Evaluation of Image Quality Measures: İsmail Avcıbaş
Statistical Evaluation of Image Quality Measures: İsmail Avcıbaş
*
İsmail Avcıbaş1, Bülent Sankur2, Khalid Sayood3
1
Department of Electronic Engineering, Uludağ University, Bursa, Turkey
2
Department of Electrical and Electronic Engineering, Boğaziçi University, İstanbul, Turkey
3
Department of Electrical Engineering, University of Nebraska at Lincoln, NE, USA
Corresponding author E-mail: [email protected]
ABSTRACT
In this work we categorize comprehensively image quality measures, extend
measures defined for gray scale images to their multispectral case, and propose
novel image quality measures. They are categorized into pixel difference-based,
correlation-based, edge-based, spectral-based, context based and HVS-based
(Human Visual System-based) measures. Furthermore we compare these
measures statistically for still image compression applications. The statistical
behavior of the measures and their sensitivity to coding artifacts are investigated
via Analysis of Variance techniques. Their similarities or differences have been
illustrated by plotting their Kohonen maps. Measures that give consistent scores
across an image class and that are sensitive to coding artifacts are pointed out. It
has been found that measures based on phase spectrum, on multiresolution
distance or HVS filtered mean square error are computationally simple and are
more responsive to coding artifacts.
Keywords: Image quality measures; ANOVA analysis; Self-Organizing Map.
1. Introduction
Image quality measures (IQM) are figures of merit used for the evaluation of imaging
systems or of coding/processing techniques. In this study we consider several image quality
metrics and study their statistical behavior when measuring various compression and/or
sensor artifacts.
A good objective quality measure should well reflect the distortion on the image due to, for
example, blurring, noise, compression, sensor inadequacy. One expects that such measures
could be instrumental in predicting the performance of vision-based algorithms such as
feature extraction, image-based measurements, detection, tracking, and segmentation etc.
tasks. Our approach is different from companion studies in the literature focused on
subjective image quality criteria, such as in [1], [2], [3]. In the subjective assessment of
measures characteristics of the human perception become paramount, and image quality is
*
The work was sponsored by NSF-INT 9996097 and The Scientific Council of Turkey TUBITAK, BDP
Program.
1
correlated with the preference of an observer or the performance of an operator on some
specific task.
In the image coding and computer vision literature, the most frequently used measures are
deviations between the original and the coded images ([4], [5], [6]), with Mean Square Error
(MSE) or alternatively Signal to Noise Ratio (SNR) varieties being the most common
measures. The reasons for their widespread popularity are their mathematical tractability and
the fact that it is often straightforward to design systems that minimize the MSE. Raw error
measures such as MSE work best when the distortion is due to additive noise contamination.
However they do not necessarily correspond to all aspects of the observer’s visual perception
of the errors [7], [8], nor do they correctly reflect structural coding artifacts.
For multimedia applications and very low bit rate coding, there has been an increase in the
use of quality measures based on human perception [9], [10], [11], [12], [13], [14]. Since a
human observer is the end user in multimedia applications, an image quality measure that is
based on a human vision model seems to be more appropriate for predicting user acceptance
and for system optimization. This class of distortion measures in general gives a numerical
value that will quantify the dissatisfaction of the viewer in observing the reproduced image in
place of the original (though Daly’s VPD map [13] is a counter example to this). The
alternative is the use of subjective tests where subjects view a series of reproduced images
and rate them based on the visibility of artifacts [15], [16]. Subjective tests are tedious, time
consuming and expensive, and the results depend on various factors such as observer’s
background, motivation, etc., and furthermore actually only the display quality is being
assessed. Therefore an objective measure that accurately predicts the subjective rating would
be a useful guide when optimizing image compression algorithms.
Recently there has been ITU (International Telecommunications Union) efforts to establish
objective measurement of video quality. Thus in the context of distribution of multimedia
documents, video programming in particular, in-service continuous evaluation of video
quality is needed. This continuous video quality indicator would be an input to the network
management, which must guarantee a negotiated level of service quality. Obviously such
quality monitoring can only be realized with objective methods [17, 18]. It must be pointed
out, however, that subjective assessment, albeit costly and time-consuming, if not
impractical, is accurate. Objective methods, on the other hand, can at best try to emulate the
performance of subjective methods, utilizing the knowledge of the human visual system.
Similarly for computer vision tasks, prediction of algorithmic performance in terms of
imaging distortions is of great significance [19, 20]. In the literature the performance of
feature extraction algorithms, like lines and corners [19], propagation of covariance matrices
[20], quantification of target detection performance and ideal observer performance [21],
[22], [23] have been studied under additive noise conditions. It is of great interest to correlate
coding and sensor artifacts with such algorithmic performance. More specifically one would
like to identify image quality metrics that can accurately and consistently predict the
performance of computer vision algorithms operating on distorted image records, the
distortions being due to compression, sensor inadequacy etc. An alternative use of image
quality metrics is in the inverse mapping from metrics to the nature of distortions [24]. In
other words given the image quality metrics, one tries to reconstruct the distortions (e.g., blur,
2
noise, etc. amount in a distortion space) that could have resulted in the measured metric
values.
In this paper we study objective measures of image quality and investigate their statistical
performance. Their statistical behavior is evaluated first, in terms of how discriminating they
are to distortion artifacts when tested on a variety of images using Analysis of Variance
method. The measures are then investigated in terms of their mutual correlation or similarity,
this being put into evidence by means of Kohonen maps.
Twenty-six image quality metrics are listed and described in Appendix A and summarized in
Table 1. These quality metrics are categorized into six groups according to the type of
information they are using. The categories used are:
1. Pixel difference-based measures such as mean square distortion;
2. Correlation-based measures, that is, correlation of pixels, or of the vector angular
directions;
3. Edge-based measures, that is, displacement of edge positions or their consistency across
resolution levels;
4. Spectral distance-based measures, that is Fourier magnitude and/or phase spectral
discrepancy on a block basis;
5. Context-based measures, that is penalties based on various functionals of the
multidimensional context probability;
6. Human Visual System-based measures, measures either based on the HVS-weighted
spectral distortion measures or (dis)similarity criteria used in image base browsing
functions.
We define several distortion measures in each category. The specific measures are denoted
by D1, D2 .. in the pixel difference category, as C1, C2 .. in the correlation category etc. for
ease of reference in the results and discussion sections.
The paper is organized as follows: The methodology and data sets are given in Section 2. The
descriptions of the specific measures used are relegated to the Appendix in its six
subsections. Results of the experiments and statistical analyses are presented in Section 3.
We discuss the main conclusions and the related future work in Section 4.
3
Objective video quality model attributes have been studied in [17], [18]. These attributes can
be directly translated to the still image quality measures as “IQM desiderata” in the
multimedia and computer vision applications:
♦ Prediction Accuracy: The accurate prediction of distortion, whether for algorithmic
performance and subjective assessment. For example, when quality metrics are shown in
box plots as in Fig. 1, an accurate metric will possess a small scatter plot.
♦ Prediction Monotonicity: The objective image quality measure’s scores should be
monotonic in their relationship to the performance scores.
♦ Prediction Consistency: This attribute relates to the objective quality measure’s ability to
provide consistently accurate predictions for all types of images and not to fail
excessively for a subset of images.
These desired characteristics reflect on the box plots and the F scores of the quality metrics,
as detailed in the sequel.
4
compression type, the factor of image class etc. The output of the ANOVA is the
identification of those image quality measures that are most consistent and discriminative of
the distortion artifacts due to compression, blur and noise.
Recall that ANOVA is used to compare the means of more than two independent Gaussian
distributed groups. In our case each "compression group" consists of quality scores from
various images at a certain bit rate, and there are k = 5 groups corresponding to the 5 bit rates
tested. Each group had 30 sample vectors since there were 30 multispectral test images (10
remote sensing, 10 faces, 10 textures). Similarly 3 "blur groups" were created by low-pass
filtering the images with 2-D Gaussian-shaped filters with increasing support. Finally 3
"noise groups" were created by contaminating the images with Gaussian noise with
increasing variance, (σ2 = 200, 600, 1700). This range of noise values spans the noisy image
quality from the just noticeable distortion to annoying degradation. In a concomitant
experiment [57] images were watermarked at four different insertion strengths.
Since we have two coders (i.e., JPEG and SPIHT algorithms) two-way ANOVA is
appropriate. The hypotheses for the comparison of independent groups are:
H0: µ1 = µ 2 = ... = µ k means of all the groups are equal,
5
both indicators of a good quality measure. In order to quantify the discriminative power of a
quality measure, we have normalized the difference of two successive group means by their
respective variances, i.e.,
µ r − µ r +1
Q r , r +1 = (1)
σ r σ r +1
Q = Ave{Q r , r +1 } r = 1,..., k − 1
where µ r denotes the mean value of the image quality measure for the images compressed at
rate r and σ r is the standard deviation, k is the number of different bit rates at which quality
measures are calculated. A good image quality measure should have high Q value, which
implies little overlap between groups and/or large jumps between them hence high
discriminative power of the quality measure. It should be noted that the Q values and the F-
scores yielded totally parallel results in our experiments.
In Figure 1 we give box plot examples of a good, a moderate and a poor measure. For the
box plot visualization the data has been appropriately scaled without any loss of information.
Figure 1. Box plots of quality measure scores. a) Good measure, b) Moderate measure,
c) Poor measure.
There will be 5 such vectors, one for each bit rate considered. We used a total of 30 images
x 5 bit rates x 2 compressors x 26 metrics = 7800 vectors to train the SOM.
Recall that the self-organizing map (SOM) is a tool for visualization of high dimensional
data. It maps complex, non-linear high dimensional data into simple geometric relationships
on a low dimensional array and thus serves to produce abstractions. Among the important
applications of the SOM one can cite the visualization of high dimensional data, as the case
in point, and discovery of categories and abstractions from raw data.
Let the data vectors be denoted as X = [x 1 ,..., x M ] ∈ R M , where M is the number of images
T
considered (M = 30 in our case). With each element in the SOM array, a parametric real
6
vector m i = [µ i1 ,..., µ iM ] ∈ R M is associated. The location of an input vector X on the
T
SOM array is defined by the decoder function d(X, m i ) , where d(.,.) is a general distance
measure. The location of the input vector will have the array index c defined as
c = arg min d (X, m i ) . A critical part of the algorithm is to define the m i in such a way that
i
the mapping is ordered and descriptive of distribution of X . Finding such a set of values that
minimize the distance measure resembles the standard VQ problem. In contrast, the indexing
of these values is arbitrary, whereby the mapping is unordered. However, if the minimization
of the objective functional based on the distance function is implemented under the
conditions described in [29], then one can obtain ordered values of m i , almost as if the m i
were lying at the nodes of an elastic net. With the elastic net analogy in mind, SOM
algorithm can be constructed as
m i (t + 1) = m i (t ) + α( t )[ X( t ) − m i (t )]
where α( t ) is a small scalar, if the distance between units c and i in the array is smaller than
or equal to a specified limit (radius), and α( t ) = 0 otherwise. During the course of ordering
process, α (t ) is decreased from 0.05 to 0.02, while radius of neighborhood is decreased from
10 to 3. Furthermore scores are normalized with respect to the range.
The component planes j of the SOM, i.e., the array of scalar values µ ij representing the j'th
components of the weight vectors m i and having the same format as the SOM array is
displayed as shades of gray.
7
metric. Similarly the metric that responds adequately to all sorts of distortion effects is
denoted as the “global metric”. One can notice that:
! The fundamental metrics for JPEG compression are H2, H1, S2, E2, which is, HVS L2
norm, HVS absolute norm, spectral phase-magnitude, and edge stability measures. These
measures are listed in decreasing order of F-score.
! The fundamental metrics for SPIHT compression are E2, S2, S5, and H2, that is, edge
stability, spectral phase-magnitude, block spectral phase-magnitude, and HVS L2 norm.
! The fundamental metrics for the BLUR effect are S1, E2, S2, H1, that is, spectral phase,
edge stability, spectral phase-magnitude, and HVS absolute norm. Notice the similarity
of metrics between SPIHT and blur. This is due to the fact that we primarily encounter
blur artifacts in wavelet-based compression.
! The fundamental metric for the NOISE effect is, as expected, D1, the mean square error.
! Finally the image quality metrics that are sensitive to all distortion artifacts are, in rank
order, E2, H1, S2, H2, S5, that is, edge stability, HVS absolute norm, spectral phase-
magnitude, HVS L2 norm, block spectral phase-magnitude. To establish the global
metrics, we gave rank numbers from 1 to 26 to each one metric under the four types of
distortion as in Table 2. For example for JPEG the metrics are ordered as H2, H1, S2,
E2,..etc. if we take into consideration their F-scores. Then we summed their rank
numbers, and the metrics for which the sum of the scores were the smallest were declared
as the global metric, that is the ones that qualify well in all discrimination tests. These
results must still be taken with some caution as, for example, none of the 5 winner scores
is as sensitive to additive noise as the D1 and D2 scores.
! The metrics that were the least sensitive to image set variation are D4, H3, C4, C5, D6
etc. It can be observed that these metrics in general show also poor performance in
discriminating distortion effects. On the other hand, for the distortion sensitive metrics,
even though their image set dependence is higher than the so-called “image independent”
metrics, more of the score variability is due to distortion than to image set change. This
can be observed based on the higher F-scores for distortion effects as compared to image
set related F-scores.
These observations are summarized in Table 3 where one-way results are given for each
image class (Fabrics, Faces, Remote Sensing) separately, and two-way ANOVA results are
presented for the combined set. In the two bottom rows of Table 3 the metrics that are least
sensitive to the coder type and to the image set are given. The criteria for omitting and
entering the metrics into Table 3 were the outcome of the F scores.
We also investigated the metrics with respect to their ability to respond to bit rate and coder
type. For this analysis the scores of the JPEG and SPIHT compressors were combined. It was
observed in Table 4 that:
! The metrics that were best in discriminating compression distortion as parameterized by
the bit rate, whatever the coder type, that is JPEG or SPIHT, were H2, H1, S2, S5 (HVS
8
L2 norm, HVS absolute norm, spectral phase-magnitude, block spectral phase-magnitude
etc.
! The metrics that were capable of discriminating the coder type (JPEG versus SPIHT)
were similar in the sense that they all belong to the human vision system inspired types,
namely, D6, H2, H4 and H1 (Multiresolution error, HVS L2 norm, DCTune, HVS L1
norm).
! Finally the metrics that were most sensitive to distortion artifacts, but at the same time,
least sensitive to image set variation were C5, D1, D3, S3, D2, C4..., (Mean angle-
magnitude similarity, Mean square error, Modified infinity norm, Block spectral
magnitude error, Mean absolute error, Mean angle similarity...). These metrics were
identified by summing the two rank scores of metrics, the first being the ranks in
ascending order of distortion sensitivity, the second being in descending order the image
set sensitivity. Interestingly enough almost all of them are related to the mean square
error varieties. Despite its many criticisms, this may explain why mean square error or
signal-to-noise ratio measures have proven so resilient over time. Again this conclusion
should be accepted with some caution. For example common experience indicates that
MSE measures do not necessarily reflect all the objectionable coding artifacts especially
at low bit rates.
Table 2: ANOVA results (F-scores) for the JPEG and SPIHT compression distortions as well
as additive noise and blur artifacts. For each distortion type the variation due to image set is
also established. For compression the degrees of freedom are 4 (bit rate) and 2 (image class)
while they are both 2 for the blur and noise experiments.
Table 3. One-way ANOVA results for each image class and two-way ANOVA results for
the distortions on the combined and image set independence.
Table 4. ANOVA results for the effect of bit rate (pooled data from JPEG and SPIHT), and
of the coder type. The degrees of freedom are 4 (bit rate) and 1 (coder type).
As expected the metrics that are responsive to distortions are also almost always responsive
to the image set. Conversely the metrics that do not respond to the image set variation are
also not very discriminating with respect to the distortion types. The fact that the metrics are
sensitive, as should be expected, to both the image content and distortion artifacts does not
eclipse their potential as quality metrics. Indeed when the metrics were tested within more
homogeneous image sets (that is only within Face images or Remote Sensing images etc.)
the same high-performance metrics scored consistently higher. Furthermore when one
compares the F-scores of the metrics with respect to bit rate variation and image set variation,
even though there is a non-negligible interaction factor, the F-score due to bit rate is always
larger than the F-score due to Image sets.
9
Our second investigation was on the mutual relationship between measures. It is obvious that
the quality measures must be correlated with each other as most of them must respond to
compression artifacts in similar ways. On the other hand one can conjecture that some
measures must be more sensitive to blurring effects, while others respond to blocking effects,
while still some others reflect additive noise.
Self Organizing Map (SOM) [29] is a pictorial method to display similarities and differences
between statistical variables, such as quality measures. We have therefore obtained spatial
organization of these measures via Kohonen’s self-organizing map algorithm. The input to
the SOM algorithm was vectors whose elements are the scores of the measure resulting from
different images. More explicitly, consider one of the measures, let’s say, D1, and a certain
compression algorithm, e.g., JPEG. The instances of this vector will be 60-dimensional, one
for each of the images in the set. The first 30 components consist of 30 images compressed
with JPEG, the next 30 juxtaposed components of the same images compressed with SPIHT.
Furthermore there will be five such vectors, one for each one of the bit rates.
The SOM organization of the measures in the 2-D space for pooled data from JPEG and
SPIHT coders is shown in Figure 2. These maps are useful for the visual assessment of
possible correlation present in the measures. One would expect that measures with similar
trends and which respond in similar ways to artifacts would cluster together spatially. The
main conclusions from the observation of the SOM and the correlation matrix are as follows:
! Clustering tendency of pixel difference based measures (D1, D2, D4, D5) and spectral
magnitude based method (S3) is obvious in the center portion of the map, a reflection of
the Parseval relationship. However notice that spectral phase-magnitude measures (S2,
S5) stay distinctly apart from these measures. In a similar vein purely spectral phase
measures also form a separate cluster.
! The human visual system based measures (H2, H3, H4), multiresolution pixel-difference
measure (D6), E2 (edge stability measure) and C5 (mean angle-magnitude measure) are
clustered in the right side of the map. The correlation of the multiresolution distance
measure, D6 with HVS based measures (H2, H3, H4) is not surprising since the idea
behind this measure is to mimic image comparison by eye more closely, by assigning
larger weight to low resolution components and less to the detailed high frequency
components.
! The three correlation based measures (C1, C2, C3) are together in the lower part of the
map while the two spectral phase error measures (S2, S5) are concentrated separately in
the upper part of the map.
! It is interesting to note that all the context-based measures (Z1, Z2, Z3, Z4) are grouped
in the upper left region of the map together with H1 (HVS filtered absolute error).
10
! The proximity of the Pratt measure (E1) and the maximum difference measures (D3) is
meaningful, since the maximum distortions in reconstructed images are near the edges.
The constrained maximum distance or sorted maximum distance measures can be used in
codec designs to preserve the two dimensional features, such as edges, in reconstructed
images.
In conclusion the relative positioning of measures in the two-dimensional map was in
agreement with one’s intuitive grouping and with the ANOVA results. We would like to
emphasize here that in the above SOM discussions it is only the relative position of the
measures that is significant, while their absolute positioning is arbitrary. Furthermore the
metrics that behave in an uncorrelated way in the SOM display are conjectured to respond to
different distortion artifacts and is used as an additional criterion for the selection of “good”
measures subset.
11
mark different features of the image, such as global DFT coefficients, block DCT
coefficients, pixels directly etc.
We performed ANOVA tests for several watermarking and steganography algorithms.
For example the most discriminating IQMs for the pooled steganography and watermarking
algorithms were found as: Mean Absolute Error D2, Mean Square Error D1, Czekonowsky
Correlation Measure C3, Angle Mean C4, Spectral Magnitude Distance S2, Median Block
Spectral Phase Distance S4, Median Block Weighted Spectral Distance S5, Normalized Mean
Square HVS Error H2. The implication here is two-fold: One is that, using these features a
steganalyzer can be designed to detect the watermarked or stegoed images using multivariate
regression analysis, as we show in [57, 58, 59]. This linear combination of the IQMs for
steganalysis purposes is denoted as the “supermetric” for steganalysis. It is shown in [57] that
the steganalysis supermetric can detect the presence of watermarking with 85% accuracy and
can even predict whose watermark it is [58]. The other implication is that, current
watermarking or steganographic algorithms should exercise more care on those statistically
significant image features to eschew detection [59].
For the second “supermetric” we searched for correlation between the subjective
opinions and an objective measure derived from a combination of our IQMs. The subjective
image quality experiment was conducted with a group of 17 subjects (students that took a
first course in image processing) who noted their image quality opinion scores in the 1-5
range, 1 being no distortion could be observed and 5 meaning very annoying quality. Time of
observation was unlimited. The images used were all 512 X 512 RGB color images from
Purdue University face database, and were viewed at 4x the image height. The results
reported are based on the 850 quality evaluations of 50 encoded images (10 images
compressed with JPEG at five different quality scales, Q=10, 30, 50, 70, 90) by the pool of 17
subjects. The supermetric of image quality for compression artifacts was build using the
global metrics E2, H1, S2, H2, S5, that is, edge stability, HVS absolute norm, spectral phase-
magnitude, HVS L2 norm, block spectral phase-magnitude) for the image distortions due to
compression. The supermetric was built by regressing them against the MOS scores. The
plot of this supermetric and MOS data is given in Fig. 4, where a high value of the correlation
coefficient has been determined: 0.987. The correlation coefficients of the individual metrics,
shown in Table 5, were all lower, as expected.
Table 5: Image quality metrics and their correlation coefficients with MOS data.
D1 0.893 C1 0.501 E2 0.890 Z1 0.502 H3 0.936
D2 0.895 C2 0.810 S1 0.929 Z2 0.543 H4 0.982
D3 0.720 C3 0.926 S2 0.903 Z3 0.609 Super-metric 0.987
D4 0.901 C4 0.912 S3 0.930 Z4 0.517
D5 0.381 C5 0.917 S4 0.883 H1 0.890
D6 0.904 E1 0.833 S5 0.865 H2 0.938
4. Conclusions
12
In this work we have presented collectively a comprehensive set of image quality measures
and categorized them. Using statistical tools we were able to classify more than two dozen of
measures based on their sensitivity to different types of distortions.
Statistical investigation of 26 different measures using ANOVA analyses has revealed that
local phase-magnitude measures (S2 or S5), HVS-filtered L1 and L2 norms (H1, H2), edge
stability measure (E2) are most sensitive to coding and blur artifacts, while the mean square
error (D1) remains as the best measure for additive noise. These “winner” metrics were
elected on the basis of the summed rank scores across four artifacts; JPEG-compression’
SPIHT-compression, blur and noise. This preselection of the E2, S2, S5, H1, H2 subset was
based, on the one hand, on their superior F-scores, and on the other hand, on the fact they
appeared to behave in an uncorrelated way in their SOM maps.
These metrics satisfied, in their category of distortion, the IQM desiderata stated in Section
2.1, namely accuracy, monotonicity and consistency. The H1, H2, S2, S5 and D1 metrics
were accurate in that they responded predominantly to the type of distortion stated than to
any other factor. They responded monotonically to the level of distortion, that is the metric
versus distortion parameter plotted monotonically (graph not shown). Finally their
consistencies were shown when they were tested on widely differing image classes (faces,
textures, remote sensing).
Ideally speaking one would like to have a quality measure that is able to give accurate results
for different performance levels of a given compression scheme, and across different
compression schemes. It appears that, as shown in Section 3.3, a combination of spectral
phase-and-magnitude measure and of HVS-filtered error norm comes closest to satisfy such a
measure, as it is adequately sensitive to a variety of artifacts. The Kohonen map of the
measures has been useful in depicting measures that behave similarly or in an uncorrelated
way. The correlation between various measures has been put into evidence via Kohonen’s
Self-Organizing Map.
In conclusion, the subset of the E2, S2, S5, H1, H2 metrics are the prominent image quality
measures as shown by the ANOVA analysis on the one hand, and by the good regression
property to MOS data, on the other hand. The implication is that more attention should be
paid to the spectral phase and HVS-filtered quality metrics in the design of coding algorithms
and sensor evaluation. We have also shown the validity of the ANOVA methodology in an
alternate application, that is when we applied it to the selection of IQMs for the construction
of a steganalyzer.
Future work will address the extension of the subjective experiments. Note that we have only
shown in one experiment that the selected IQMs regress well on the mean opinion scores.
However this experiment must be repeated on yet unseen data to understand how well it
predicts subjective opinion. In a similar vein the database for detection experiments is to be
extended to cover a larger variety of watermarking and steganography tools.
13
APPENDIX A
In this Appendix we define and describe the multitude of image quality measures considered.
In these definitions the pixel lattices of images A, B will be referred to as A(i, j ) and B(i, j ) ,
i, j = 1...N, as the lattices are assumed to have dimensions NxN. The pixels can take values
from the set {0,..., G} in any spectral band. The actual color images we considered had G =
255 in each band. Similarly we will denote the multispectral components of an image at the
pixel position i, j, and in band k as Ck (i, j ) , where k = 1,..., K . The boldface symbols C(i, j )
and Ĉ(i, j ) will indicate the multispectral pixel vectors at position (i,j). For example for the
color images in the RGB representation one has C(i,j) = [R(i,j) G(i,j) B(i,j)]T . All these
definitions are summarized in the following table :
N −1
Thus for example the power in the k'th band can be calculated as σ 2k = ∑ C (i, j)
2
k . All these
i , j= 0
quantities with an additional hat, i.e., Cˆ k (i, j ) , Ĉ etc., will correspond to the distorted
2
versions of the same original image. As a case in point, the expression C(i, j) − Cˆ (i, j) =
∑ [C (i, j ) − Cˆ (i, j )]
K 2
k k will denote the sum of errors in the spectral components at a given
k =1
pixel position i, j. Similarly the error expression in the last row of the above table expands as
[ ]
N N
ε 2k = ∑∑ C k (i, j) − Ĉ k (i, j) . In the specific case of RGB color images we will occasionally
2
i =1 j=1
{
revert to the notations {R, G, B} and Rˆ , Gˆ , Bˆ . }
14
The Lγ norm of the dissimilarity of two images can be calculated by taking the Minkowsky
average of the pixel differences spatially and then chromatically (that is over the bands):
1/ γ
1 K 1 N −1 γ
ε = ∑ 2
γ
∑ C k (i, j) − Ĉ k (i, j) (A1)
K k =1 N i , j= 0
Alternately the Minkowsky average can be first carried over the bands and then spatially, as
in the following expression:
1/ γ
N −1 1 K
γ
γ 1
ε = 2 ∑ ∑ k C (i, j) − Ĉ k (i , j) .
N i , j= 0 K k =1
In what follows we have used the pixel-wise difference in the Minkowsky sum as given in
Eq. (A1). For γ = 2, one obtains the well-known Mean Square Error (MSE) expression,
denoted as D1:
N −1 2 K
1 1 1
D1 =
K N2
∑ C(i, j) − Cˆ (i, j)
i , j= 0
=
K
∑ε
k =1
2
k
. (A2)
An overwhelming number of quality results in the literature is in fact given in terms of the
Signal to Noise Ratio (SNR) or the Peak SNR (PSNR), which are obtained, respectively, by
dividing the image power by D1, and by dividing the peak power G2 by D1. Though the
SNR and the PSNR are very frequently used in quantifying coding distortions, their
shortcomings have been pointed out in various studies [13]. However, despite these oft cited
criticisms of the MSE-based quality measures there has been a recent resurgence of the
SNR/PSNR metrics [17, 18]. For example studies of the Video Quality Expert Group
(VQEG) [17] have shown that the PSNR measure is a very good indicator of subjective
preference in video coding.
For γ = 1 one obtains the absolute difference denoted as D2. For γ = ∞ power in the
Minkowski average the maximum difference measure
K
Ck (i, j) − Ĉk (i, j) = max|| C(i, j) − C
1
ε∞ = max∑ ˆ (i, j) ||
k=1 K
i, j i, j
is obtained. Recall that in signal and image processing the maximum difference or the infinity
norm is very commonly used [6]. However given the noise-prone nature of the maximum
difference, this metric can be made more robust by considering the ranked list of pixel
differences ∆ l C − C( )
ˆ , l = 1 K N 2 , resulting in a modified Minkowski infinity metric,
called D3. Here ∆ C − C ( )
ˆ denotes the lth largest deviation among all pixels [31]. Thus
( )
l
ˆ is simply the error expression ε∞ above. Similarly ∆ correspond to the second
∆1 C − C 2
largest term etc. Finally a modified maximum difference measure using the first r of ∆ m
terms, can be constructed by computing the root mean square value of the ranked largest
differences, ∆ 1 K ∆ r .
15
D3 =
1 r 2
∑ ∆ m C − Cˆ
r m =1
( ) (A3)
The choice of the color-space for an image similarity metric is important, because the color-
space must be uniform, so that the intensity difference between two colors must be consistent
with the color difference estimated by a human observer. Since the RGB color-space is not
well-suited for this task two color spaces are defined: 1976 CIE L*u*v* and the 1976 CIE
L*a*b* color spaces [32]. One recommended color-difference equation for the Lab color-
space is given by the Euclidean distance [33]. Let
∆L* (i, j ) = L* (i, j ) − Lˆ* (i, j )
∆a* (i, j ) = a* (i, j ) − aˆ * (i, j )
∆b* (i, j ) = b* (i, j ) − bˆ* (i, j )
denote the color component differences in L*a*b* space. Then the Euclidean distance is:
∑ [∆L (i, j ) ]
N −1
+ ∆a * (i, j ) + ∆b * (i, j ) .
1
D4 = * 2 2 2
(A4)
N2 i , j =0
Note that (A4) is intended to yield perceptually uniform spacing of colors that exhibit color
differences greater than JND threshold but smaller than those in Munsell book of color [33].
This measure applies obviously to color images only and cannot be generalized to arbitrary
multispectral images. Therefore it has been used only for the face images and texture images,
and not in satellite images.
Image distortion on a pixel level can arise from differences in the gray level of the pixels
and/or from the displacements of the pixel. A distortion measure that penalizes in a
graduated way spatial displacements in addition to gray level differences, and that allows
therefore some tolerance for pixel shifts can be defined as follows [34], [35]:
{( )} {( )}
N −w / 2
[ min d C(i, j ), C
ˆ (l , m ) ]2 + [ min d C
ˆ (i, j ), C(l , m ) ]2
1
D5 =
2( N − w) 2
∑
i, j =w / 2
l , m∈wi , j l , m∈wi , j
(A5)
where d (⋅ , ⋅) is some appropriate distance metric. Notice that for w=1 this metric reduces to
the mean square error as in D1.
Thus for any given pixel C(i, j ) , we search for a best matching pixel in the d-distance sense
in the wxw neighborhood of the pixel Ĉ(i, j ) , denoted as Ĉw (i, j ) . The size of the
neighborhood is typically small e.g., 3x3 or 5x5, and one can consider a square or a cross-
16
shaped support. Similarly one calculates the distance from Ĉ(i, j) to C w (i, j) where again
C w (i, j) denotes the pixels in the wxw neighborhood of coordinates (i,j) of C(i, j) . Note that
(
in general d C(i, j ), C ) (
ˆ (i, j ) is not equal to d C
w )
ˆ (i, j ), C (i, j ) . As for the distance measure
w
d (⋅ , ⋅) , the city metric or the chessboard metric can be used. For example city block metric
becomes
C(i, j) − C
ˆ (l, m )
( ) ˆ (l, m ) = ( i − l + j − m ) +
d city C(i, j), C
N G
where ||.|| denotes the norm of the difference between C(i, j) and Ĉ(i, j) vectors. Thus both
the pixel color difference and search displacement are considered. In this expression N and G
are one possible set of normalization factors to tune the penalties due to pixel shifts and pixel
spectral differences, respectively. In our measurements we have used the city block distance
with 3x3 neighborhood size.
One limitation of standard objective measures of distance between images is that the
comparison is conducted at the full image resolution. Alternative measures can be defined
that resemble image perception in the human visual system more closely, by assigning larger
weights to low resolutions and smaller weights to the detail image [36]. Such measures are
also more realistic in machine vision tasks that often use local information only.
Consider the various levels of resolution denoted by r ≥ 1 . For each value of r the image is
split into blocks b1 to bn where n depends on scale r. For example for r = 1, at the lowest
resolution, only one block covers the whole image characterized by its average gray level g.
N N
For r = 2 one has four blocks each of size ( x ) with average gray levels g11, g12, g21 and
2 2
N N
g22. For the rth resolution level one would have than 2 2 r − 2 blocks of size ( r −1 x r −1 ),
2 2
2 r −2
characterized by the block average gray levels gij, i, j = 1,...,2 . Thus for each block bij of
the image C , take gij as its average gray level and ĝ ij to corresponds to its component in the
image Ĉ (For simplicity a third index denoting the resolution level has been omitted). The
1
average difference in gray level at the resolution r has weight r . Therefore the distortion at
2
this level is
r −1
1 1 2
d r = r 2 r − 2 ∑ g ij − gˆ ij
2 2 i , j =1
r −1
where 2 is the number of blocks along either the i and j indices. If one considers a total of
R resolution levels, then a distance function can simply be found by summing over all
17
( )
R
resolution levels, r = 1,.., R, that is D C , Cˆ = ∑ d r . The actual value of R (the number of
r =1
resolution levels) will be set by the initial resolution of the digital image. For example for a
512x512 images one has R = 9. Finally for multispectral images one can extend this
definition in two ways. In the straightforward extension, one sums the multiresolution
distances d rk over the bands:
1 K R k
D6 = ∑∑ d r
K k =1 r =1
(A6)
where d rk is the multiresolution distance in the kth band. This is the multiresolution distance
definition that we used in the experiments. Alternatively the Burt pyramid was constructed to
obtain the multiresolution representation. However in the tests it did not perform as well as
the pyramid described in [36.
( ) ∑ [(g )]
2 r −1
) + (g ) + (g
R 1
1 1
D C, Ĉ = ∑ d ′r with d r = r 2 r − 2
2 2 2
− ĝ ijR − ĝ ijG − ĝ ijB
R G B 2
ij ij ij
r =1 2 2 i , j=1
where, for example, g ijR is the average gray level of the ij'th block in the "red" component of
the image at the (implicit) resolution level r. Notice that in the latter equation the Euclidean
norm of the differences of the block average color components R, G, B have been utilized.
Notice that the last two measures, that is, the neighborhood distance measure and the
multiresolution distance measure have not been previously used in evaluating compressed
images.
B. Correlation-Based Measures
Structural Content:
N −1
1 K ∑ C (i, j )
k
2
∑
i, j =0
C1 = N −1
, (A7)
∑ Cˆ k (i, j )
K k =1 2
i, j =0
18
Normalized Cross-Correlation measure:
N −1
C2 = ∑ i, j =0
N −1
, (A8)
∑ Ck (i, j )
K k =1 2
i, j =0
Czenakowski Distance:
A metric useful to compare vectors with strictly non-negative components, as in the case of
color images, is given by the Czenakowski distance [37]:
(
)
K
N −1
2 ∑ min C k (i, j) , Ĉ k (i, j)
C3 = 2 ∑ 1 − k =K1 .
1
(A9)
N i , j= 0
∑ C k (i,(j) + Ĉ k (i , j))
k =1
The Czenakowski coefficient [38] (also called the percentage similarity) measures the
similarity between different samples, communities, and quadrates.
Obviously as the difference between two images tends to zero ε = C − Ĉ → 0 , all the
correlation-based measures tend to 1, while as ε 2 → G 2 they tend to 0. Recall also that
distance measures and correlation measures are complementary, so that under certain
conditions, minimizing distance measures is tantamount to maximizing the correlation
measure [39].
ˆ (i, j )
C(i, j ), C ˆ (i, j )
C(i, j ) − C
2 1 −
χ ij = 1 − 1 − cos −1
We can use the moments of the spectral (chromatic) vector differences as distortion
measures. To this effect we have used the mean of the angle difference (C4) and the mean of
the combined angle-magnitude difference (C5) as in the following two measures:
19
N ˆ (i, j )
C(i, j ), C
1 2
C4 = µχ = 1 − 2
N
∑ (π cos −1
ˆ (i, j )
), (A10)
i , j =1 C(i, j ) C
1 N
C5 = ∑ χ ij ,
N 2 i, j=1
(A11)
where µ χ is the mean of the angular differences. These moments have been previously used
for the assessment of directional correlation between color vectors.
According to the contour-texture paradigm of images, the edges form the most informative
part in images. For example, in the perception of scene content by human visual system,
edges play the major role. Similarly machine vision algorithms often rely on feature maps
obtained from the edges. Thus, task performance in vision, whether by humans or machines,
is highly dependent on the quality of the edges and other two-dimensional features such as
corners [9], [41], [42]. Some examples of edge degradations are: Discontinuities in the edge,
decrease of edge sharpness by smoothing effects, offset of edge position, missing edge points,
falsely detected edge points etc [39]. Notice however that all the above degradations are not
necessarily observed as edge and corner information in images is rather well preserved by
most compression algorithms.
Since we do not possess the ground-truthed edge map, we have used the edge map obtained
from the original uncompressed images as the reference. Thus to obtain edge-based quality
measures we have generated edge fields from both the original and compressed images using
the Canny detector [43]. We have not used any multiband edge detector; instead a separate
edge map from each band has been obtained. The outputs of the derivative of Gaussians of
each band are averaged, and the resulting average output is interpolated, thresholded and
thinned in a manner similar to that in [12]. The parameters are set as in [43] at
robotics.eecs.berkeley.edu/~sastry/ee20/cacode.html.
In summary for each band k=1...K, horizontal and vertical gradients and their norms, G xk , G ky
2 2
and N k = G xk + G yk are found. Their average over bands is calculated and thresholded
20
A measure introduced by Pratt [39] considers both edge location accuracy and missing / false
alarm edge elements. This measure is based on the knowledge of an ideal reference edge
map, where the reference edges should have preferably a width of one pixel. The figure of
merit is defined as:
nd
1 1
E1 = ∑
max{nd , nt } i =1 1 + adi2
(A12)
where nd and nt are the number of detected and ground-truth edge points, respectively, and di
is the distance to the closest edge candidate for the ith detected edge pixel. In our study the
binary edge field obtained from the uncompressed image is considered as the “ground truth”,
or the reference edge field. The factor max{nd , nt } penalizes the number of false alarm edges
or conversely missing edges.
This scaling factor provides a relative weighting between smeared edges and thin but offset
edges. The sum terms penalize possible shifts from the correct edge positions. In summary
the smearing and offset effects are all included in the Pratt measure, which provides an
impression of overall quality.
.
An edge stability map Q(i, j ) is obtained by considering the longest subsequence
E (i, j , σ m ),..., E (i, j , σ m +l −1 ) of edge images such that
Q (i, j ) = l where l = arg max I E (i, j,σ ) = 1 .
k
l σ m ≤ σ k ≤ σ m + l −1
The edge stability index calculated from distorted image at pixel position i,j will be denoted
by Qˆ (i, j ) . We have used five scales to obtain the edge maps of five band-pass filtered
images. Finally a fidelity measure called Edge Stability Mean Square Error (ESMSE) can be
21
calculated by summing the differences in edge stability indexes over all edge pixel positions,
nd, that is the edge pixels of the ground-truth (undistorted) image at full resolution.
E2 =
1 nd
∑
n d i , j=0
( )
Q(i, j) − Q̂(i, j)
2
(A13)
For multispectral images the index in (A13) can be simply averaged over the bands.
Alternatively a single edge field from multiband images [45, 46] can be obtained and the
resulting edge discrepancies measured as Eq. (A13).
A property complementary to edge information could be the surface curvature [47], which is
a useful feature for scene analysis, feature extraction and object recognition. Estimates of
local surface types [48], based on the signs of the mean and Gaussian curvatures, have been
widely used for image segmentation and classification algorithms. If one models a gray level
image as a 3-D topological surface, then one can analyze this surface locally using
differential geometry. A measure based on the discrepancy of mean and Gaussian curvatures
between an image and its distorted version has been used in [49]. However this measure was
not pursued further due to the subjective assignment of weights to the surface types and the
fact that this measure didn't perform particularly well in preliminary tests.
In this category we consider the distortion penalty functions obtained from the complex
Fourier spectrum of images [10].
Spectral distortion measures, using difference metrics as for example given in A.1-A.3 can be
extended to multispectral images. To this effect considering the phase and magnitude
spectra, that is
ϕ (u, v) = arctan(Γ(u, v) )
M (u , v ) = Γ(u , v) ,
the distortion occurring in the phase and magnitude spectra can be separately calculated and
weighted. Thus one can define the spectral magnitude distortion
N −1 2
∑ M (u, v ) − Mˆ (u, v) ,
1
S= 2
N u ,v = 0
22
2
1 N −1
S1 = 2 ∑ ϕ (u , v ) − ϕˆ (u , v ) (A14)
N u ,v = 0
and the weighted spectral distortion
1 N −1
2 N −1 2
S 2 = 2 λ ∑ ϕ (u , v ) − ϕˆ (u , v ) + (1 − λ ) ∑ M (u, v ) − M (u , v )
ˆ (A15)
N u ,v =0 u ,v = 0
where λ is to be judiciously chosen e.g., to reflect quality judgement. These ideas can be
extended in a straightforward manner to multiple band images, by summing over all band
distortions. In the following computations, λ is chosen so as to render the contributions of
the magnitude and phase terms commensurate, that λ = 2.5x10 −5 .
Due to the localized nature of distortion and/or the non-stationary image field, Minkowsky
averaging of block spectral distortions may be more advantageous. Thus an image can be
divided into non-overlapping or overlapping L blocks of size b x b, say 16x16, and blockwise
spectral distortions as in (A14-A15) can be computed. Let the DFT of the lth block of the kth
band image Ckl (m, n ) be Γkl (u , v ) :
b −1
u v
Γkl (u , v ) = ∑ C (m, n)exp− 2πim b exp− 2πin b
l
k
m, n = 0
b b
where u , v = − ... and l = 1,..., L ,or in the magnitude-phase form
2 2
Γkl (u , v ) = Γkl (u , v ) e jφ k (u , v ) = M kl (u , v )eφ k (u ,v ) .
l l
Then the following measures can be defined in the transform domain over the lth block.
( )
1
K
b −1 l γ γ
∑ Γk (u , v ) − Γˆ kl (u , v )
1
J l
M = ∑
k =1 u , v = 0
K
( )
1
b −1 l
K γ γ
∑ φ k (u , v ) − φˆkl (u , v )
1
Jϕ =l
∑
k =1 u , v = 0
K
J = λJ M + (1 − λ )Jϕl
l l
with λ the relative weighting factor of the magnitude and phase spectra. Obviously
measures A.16-A.18 are special cases of the above definitions for block size b covering the
whole image. Various rank order operations on the block spectral differences J M and / or J ϕ
can prove useful. Thus let J (1) ,..., J ( L ) be the rank ordered block distortions, such that for
{ }
example J ( L ) = max J l . Then one can consider the following rank order averages: Median
l
1 2
L L +1
block distortion J + J 2 , Maximum block distortion, J ( L ) ; and Average block
2
1 L (i )
distortion: ∑ J . We have found that median of the block distortions is the most effective
L i =1
averaging of rank ordered block spectral distortions and we have thus used:
23
S 3 = median J Ml (A.16)
l
S 4 = median J φl (A.17)
l
S 5 = median J l (A.18)
l
In this study we have averaged the block spectra with γ =2 and as for the choice of the block
size we have found that block sizes of 32 and 64 yield better results than sizes in the lower or
higher ranges.
Context Measures
Most of the compression algorithms and computer vision tasks are based on the
neighborhood information of the pixels. In this sense any loss of information in the pixel
neighborhoods, that is, damage to pixel contexts could be a good measure of overall image
distortion. Since such statistical information lies in the context probabilities, that is the joint
probability mass function (p.m.f.) of pixel neighborhoods, changes in the context
probabilities should be indicative of image distortions.
A major hurdle in the computation of context distortions is the requirement to calculate the
high dimensional joint probability mass function. Typical p.m.f. dimensions would be of the
order of s = 10 at least. Consequently one incurs “curse of dimensionality problems”.
However as detailed in [50], [51], this problem can be solved by judicious usage of kernel
estimation and cluster analysis. A modification of the kernel method is to identify the
important regions in a s-dimensional space X s by cluster analysis and to fit region-specific
kernels to these locations. The result is a model that well represents both mode and tail
regions of p.m.f’s, while combining the summarizing strength of histograms with
generalizing strength of kernel estimates.
In what follows we have used the causal neighborhood of pixels i.e., Ck (i, j ) , Ck (i − 1, j ) ,
Ck (i, j − 1) , Ck (i − 1, j − 1) , k=1, 2, 3. Hence we have derived s = 12 dimensional p.m.f’s
obtained from 4-pixel neighborhoods in the 3-bands.
24
If instead of the true probability, a perturbed version p̂ , that is the p.m.f of the distorted
image, is used, then the average bit rate R( pˆ ) becomes
R( pˆ ) = − ∑ p ( X) log 2 pˆ ( X) = H (X ) + D ( p pˆ ) .
X∈ X s
The increase in the entropy rate is also indicative of how much the context probability differs
from the original due to coding artifacts. However we do not know the true p.m.f. p, hence
its rate. We can bypass this problem by comparing two competing compression algorithms,
in terms of the resulting context probabilities p̂ 1 and p̂ 2 . If p̂ 1 and p̂ 2 are the p.m.f.’s
resulting from the two compressed images, then their difference in relative entropy
Z1 = D(p p̂1 ) − D(p p̂ 2 ) = R (p̂1 ) − R (p̂ 2 ) (A.19)
is easily and reliably estimated from a moderate-size sample by subtracting the sample
average of − log pˆ 2 from that of − log pˆ1 [51]. The comparison can be carried out for more
than two images compressed to different bit rates in a similar way, that is comparing them
two by two since the unknown entropy term is common to all of them.
As a quality measure for images we have calculated Z1 for each image when they were
compressed at two consecutive bit rates, for example, R( p̂1 ) at the bit rate of of quality factor
90 and R( p̂2 ) at the bit rate of quality factor 70, etc.. Alternatively the distortion was
calculated for an original image and its blurred or noise contaminated version.
E.2. f-divergences
Once the joint p.m.f of a pixel context is obtained, several information theoretic distortion
measures [53] can be used. Most of these measures can be expressed in the following general
form
pˆ
d ( p, pˆ ) = g E p f
p
p̂
where is the likelihood ratio between, p̂ , the context p.m.f. of the distorted image, p the
p
p.m.f. function of the original image, and Ep is the expectation with respect to p. Some
examples are as follows:
Hellinger Distance: f ( x ) = ( )
x − 1 , g (x ) =
2 1
2
x
Z2 =
1
2∫
( pˆ − )2
p dλ (A.20)
r
Generalized Matusita Distance: f ( x ) = 1− x , g (x ) = x
1 1
r r
∫p
r
Z3 = 1/ r
− pˆ 1 / r dλ r ≥1 (A.21)
Notice that integration in (A.20-A.21) are carried out in s-dimensional space. Also we have
found according to ANOVA analysis that the choice of r = 5 in the Matusita distance, yields
good results. Despite the fact that the p.m.f.'s do not reflect directly the structural content or
25
the geometrical features in an image, they perform sufficiently well to differentiate artifacts
between the original and test images.
block number u and of the k'th spectral component. However the KS distance did not turn out
to be effective in the ANOVA tests. Instead the SRC measure had a better performance. We
again considered the SRC on a 16x16 block basis and we took the maximum over the three
spectral bands. The block SRC measure was computed by computing the rank scores of the
“gray” levels in the bands and their largest for each pixel neighborhood. Finally the
correlation of the block ranks of the original and distorted images is calculated:
{ }
b
Z 4 = ∑ max SRC uk (A.22)
u =1 k =1.. K
k
where SRC denotes the Spearman Rank Correlation for the u’th block number and k’th
u
spectral band.
Despite the quest for objective image distortion measure it is intriguing to find out the role of
HVS based measures. The HVS is too complex to be fully understood with present
psychophysical means, but the incorporation of even a simplified HVS model into objective
measures reportedly [7], [54], [10], [14] leads to a better correlation with the subjective
ratings. It is conjectured therefore that in machine vision tasks they may have as well some
relevance.
In order to obtain a closer relation with the assessment by the human visual system, both the
original and coded images can be preprocessed via filters that simulate the HVS. One of the
models for the human visual system is given as a band-pass filter with a transfer function in
polar coordinates [54]:
26
0.05e ρ ρ <7
0.554
k k
1 K i, j =0
H1 = ∑ N −1
(A23)
∑ U {C (i, j )}
K k =1
k
i, j =0
L2 Norm:
{ }
1/ 2
K 1 N −1 2
U {C k (i, j )} − U Cˆ k (i, j )
1
H2 = ∑ 2
k =1 N
∑ . (A24)
K i , j =0
102
H 3 = ∑ω i di (A25)
i =1
27
where ω i are their weights as attributed in [55] and d i are the individual feature
discrepancies. We call this metric “browsing metric” for the lack of a better name. For
example the color contrast distortion at scale l is given by
∑( )
Nl 2
K (i, j ) − Kˆ (i, j )
1
dµ =
Nl Nl i , j =0
where N l xN l is the size of the image at scale l. K (i, j ) and Kˆ (i, j ) denote any color or
contrast channel of the original image and of the coded image at a certain level l. The
lengthy details of the algorithm and its adaptation to our problem are summarized in [14],
[55]. Finally note that this measure was used only for color images, and not in the case of
satellite three-band images.
The last quality measure we used that reflects the properties of the human visual system was
the DCTune algorithm [56]. DCTune is in fact a technique for optimizing JPEG still image
compression. DCTune calculates the best JPEG quantization matrices to achieve the
maximum possible compression for a specified perceptual error, given a particular image and
a particular set of viewing conditions. DCTune also allows the user to compute the perceptual
error between two images in units of jnd (just-noticeable differences) between a reference
image and a test image (http://vision.arc.nasa.gov/dctune/dctune2.0.html). This figure was
used as the last metric (H4) in Table 1.
28
LIST OF FIGURES
Figure 1. Box plots of quality measure scores. a) Good measure, b) Moderate measure,
c) Poor measure.
Figure 2. SOM map of distortion measures for JPEG and SPIHT.
Figure 3. The plot of Mean Opinion Score (MOS) and image quality supermetric data.
LIST OF TABLES
Table 2: ANOVA results (F-scores) for the JPEG and SPIHT compression distortions as well
as additive noise and blur artifacts. For each distortion type the variation due to image set is
also established.
Table 3: ANOVA results (F-scores and Interaction) for JPEG , SPIHT, Blur and Noise for
Fabrics Images for All Image Classes
Table 4: ANOVA results for the effect of bit rate (pooled data from JPEG and SPIHT), and
of the coder type.
Table 5: Image quality metrics and their correlation coefficients with MOS data.
29
Acknowledgements: We would like to thank H. Brettel (ENST, France), A. Eskicioglu
(Thompson Comm., USA) as well as an anonymous reviewer for their invaluable help in
improving the paper.
30
a) An example of good measure: the H2 b) An example of mediocre measure: the
measure with JPEG compression D1 measure with JPEG compression
achieving F=2291 score. achieving F=104.6 score
Figure 1. Box plots of quality measure scores. a) Good measure, b) Moderate measure, c)
Poor measure. The F-scores are also given.
31
Figure 2. SOM of distortion measures for JPEG and SPIHT.
32
Figure 3. The plot of Mean Opinion Score (MOS) and image quality supermetric data against
bit rate.
33
Table 1: List of symbols and equation numbers of the quality metrics.
SYMBOL DESCRIPTION EQUATION
D1 Mean Square Error A.1
D2 Mean Absolute Error A.2
D3 Modified Infinity Norm A.3
D4 L*a*b* Perceptual Error A.4
D5 Neighborhood Error A.5
D6 Multiresolution Error A.6
C1 Normalized Cross-Correlation A.7
C2 Image Fidelity A.8
C3 Czekonowski Correlation A.9
C4 Mean Angle Similarity A.10
C5 Mean Angle-Magnitude Similarity A.11
E1 Pratt Edge Measure A.12
E2 Edge Stability Measure A.13
S1 Spectral Phase Error A.14
S2 Spectral Phase-Magnitude Error A.15
S3 Block Spectral Magnitude Error A.16
S4 Block Spectral Phase Error A.17
S5 Block Spectral Phase-Magnitude Error A.18
Z1 Rate Distortion Measure A.19
Z2 Hellinger distance A.20
Z3 Generalized Matusita distance A.21
Z4 Spearman Rank Correlation A.22
H1 HVS Absolute Norm A.23
H2 HVS L2 Norm A.24
H3 Browsing Similarity A.25
H4 DCTune
34
Table 2: ANOVA results (F-scores) for the JPEG and SPIHT compression distortions as well as
additive noise and blur artifacts. For each distortion type the variation due to image set is also
established. For compression the degrees of freedom are 4 (bit rate) and 2 (image class) while they
are both 2 for the blur and noise experiments.
35
Table 4. ANOVA results for the effect of bit rate (pooled data from JPEG and SPIHT), and
of the coder type. The degrees of freedom are 4 (bit rate) and 1 (coder type).
ANOVA2 JPEG+SPIHT
Metric Bit rate Coder
D1 89.79 0.75
D2 74.98 2.72
D3 71.55 1.21
D4 70.52 43.85
D5 17.07 0.0005
D6 85.22 118.8
C1 2.66 45.47
C2 12.28 18.27
C3 56.48 1.56
C4 31.3 2.43
C5 78.98 2.23
E1 42.69 11.61
E2 122.4 26.28
S1 99.12 5.29
S2 140.1 12.37
S3 92.99 9.27
S4 115.5 39.1
S5 124.8 43.09
Z1 4.28 41.6
Z2 9.54 0.83
Z3 12.87 0.56
Z4 9.39 6.64
H1 278.6 52.87
H2 493 87.21
H3 97.94 16.19
H4 21.13 57.72
36
References
1. S.M. Perlmutter, P.C Cosman, R.M. Gray, R.A. Olshen, D.Ikeda, C.N. Adams, B.J. Betts,
M.B. Williams, K.O. Perlmutter, J. Li, A. Aiyer, L. Fajardo, R. Birdwell, B.L. Daniel,
“Image Quality in Lossy Compressed Digital Mammograms”, Signal Processing, 59,
189-210 (1997).
2. C. B. Lambrecht, Ed., “Special Issue on Image and Video Quality Metrics”, Signal
Processing, vol. 70, (1998).
3. T. Lehmann, A. Sovakar, W. Schmitt, R. Repges, “A comparison of Similarity Measures
for Digital Subtraction Radiography”, Comput. Biol. Med., 27(2), 151-167 (1997).
4. A. M. Eskicioğlu, “Application of Multidimensional Quality Measures to Reconstructed
Medical Images”, Opt. Eng. 35(3) 778-785 (1996).
5. A. M. Eskicioğlu, P. S. Fisher, “Image Quality Measures and Their Performance”, IEEE
Trans. Commun., 43(12), 2959-2965 (1995).
6. H. de Ridder, “Minkowsky Metrics as a Combination Rule for Digital Image Coding
Impairments”, in Human Vision, Visual Processing, and Digital Display III, Proc. SPIE
1666, 17-27 (1992).
7. A. B. Watson, Ed., Digital Images and Human Vision, Cambridge, MA, MIT Press
(1993).
8. B. Girod, “What’s Wrong with Mean-squared Error”, in Digital Images and Human
Vision, A. B. Watson, ed., Chapter 15, Cambridge, MA, MIT Press (1993).
9. M. Miyahara, K. Kotani, V. R. Algazi, “Objective Picture Quality Scale (PQS) for Image
Coding”, IEEE Trans. Commun., 46(9), 1213-1226 (1998).
10. N. B. Nill, B. H. Bouzas, “Objective Image Quality Measure Derived From Digital Image
Power Spectra”, Opt. Eng., 31(4), 813-825, (1992).
11. P. Franti, “Blockwise Distortion Measure for Statistical and Structural Errors in Digital
Images” Signal Processing: Image Communication, 13, 89-98 (1998).
12. S. Winkler: "A perceptual distortion metric for digital color images." in Proc. 5th
International Conference on Image Processing, vol. 3, pp. 399-403, Chicago, Illinois,
October 4-7, 1998.
13. S. Daly, “The visible differences predictor: An algorithm for the assessment of image
fidelity”, in Digital Images and Human Vision, A. B. Watson, ed., Cambridge, MA, MIT
Press, 179-205 (1993).
14. T. Frese, C. A. Bouman and J. P. Allebach, “Methodology for Designing Image
Similarity Metrics Based on Human Visual System Models”, Proceedings of SPIE/IS&T
Conference on Human Vision and Electronic Imaging II, Vol. 3016, San Jose, CA, 472-
483 (1997).
15. CCIR, “Rec. 500-2 Method for the Subjective Assessment of the Quality of Television
Pictures”, (1986).
16. M. Van Dijk, J. B. Martens, “Subjective Quality Assessment of Compressed Images”,
Signal Processing, 58, 235-252 (1997).
17. A.M. Rohaly, P. Corriveau, J. Libert, A. Webster, V. Baroncini, J. Beerends, J.L Blin, L.
Contin, T. Hamada, D. Harrison, A. Hekstra, J. Lubin, Y. Nishida, R. Nishihara, J.
Pearson, A. F. Pessoa, N. Pickford, A. Schertz, M. Visca, A. B. Watson, S. Winkler:
37
"Video Quality Experts Group: Current results and future directions." Proc. SPIE Visual
Communications and Image Processing, vol. 4067, Perth, Australia, June 21-23, 2000.
18. P. Corriveau, A. Webster, "VQEG Evaluation of Objective Methods of Video Quality
Assessment", SMPTE Journal, 108, 645-648, 1999.
19. T. Kanugo, R. M. Haralick, “A Methodology for Quantitative Performance Evolution of
Detection Algorithms”, IEEE Trans. Image Process., 4(12), 1667-1673, (1995).
20. R. Matrik, M. Petrou, J. Kittler, “Error-Sensitivity Assessment of Vision Algorithms”,
IEE Proc.-Vis. Image Signal Processing, 145(2), 124-130 (1998).
21. M. Grim, H. Szu, “Video Compression Quality Metrics Correlation with Aided Target
Recognition (ATR) Applications”, J. of Electronic Imaging, 7(4), 740-745, (1998).
22. H. H. Barrett, “Objective Assessment of Image Quality: Effects of Quantum Noise and
Object Variability”, J. Opt. Soc. Am., A(7), 1261-1278 (1990).
23. H. H. Barrett, J. L. Denny, R. F. Wagner, K. J. Myers, “Objective Assessment of Image
Quality II: Fisher Information, Fourier-Crosstalk, and Figures of Merit for Task
Performance”, J. Opt. Soc. Am., A(12), 834-852, (1995).
24. C.E. Halford, K.A. Krapels, R.G. Driggers, E.E. Burroughs, Developing Operational
Performance Metrics Using Image Comparison Metrics and the Concept of Degradation
Space, Optical Engineering, 38 (5), 836-844, 1999.
25. G. K. Wallace, “The JPEG Still Picture Compression Standard”, IEEE Trans. Consumer
Electron., 38(1), 18-34 (1992).
26. A. Said, W. A. Pearlman, “A New Fast and Efficient Image Codec Based on Set
Partitioning in Hierarchical Trees”, IEEE Trans. Circuits and Syst. Video Technol., 6(3),
243-250 (1996).
27. A.M.Martinez, R. Benavente, The AR Face Database, CVC Technical Report No. 24,
June 1998.
28. A. C. Rencher, Methods of Multivariate Analysis, New York, John Wiley (1995).
29. T. Kohonen, Self-Organizing Maps. Springer-Verlag, Heidelberg, (1995).
30. A. W. Lohmann, D. Mendelovic, G. Shabtay, “Significance of Phase and Amplitude in
the Fourier Domain”, J. Opt. Soc. of Am., 14, 2901 - 2904 (1997).
31. M. P. Dubuisson, A. K. Jain, “A Modified Hausdorff Distance for Object Matching”,
Inter. Conf. on Pattern Recognition, A: 566-569, Jerusalem (1994).
32. International Commission of Illumination (CIE), Recommendations on Uniform Color
Spaces, Color Difference Equations, Psychometric Color Terms, Publication CIE 15 (E.-
1.3.1), Supplement No. 2, Bureau Central de la CIE, Vienna, (1971).
33. A. K. Jain, Fundamentals of Digital Image Processing, New Jersey, Prentice Hall (1989).
34. V. DiGesu, V. V. Staravoitov, “Distance-based Functions for Image Comparison”,
Pattern Recognition Letters, 20(2), 207-213 (1999).
35. V. V. Starovoitov, C. Köse, B. Sankur, “Generalized Distance Based Matching of
Nonbinary Images”, International Conference on Image Processing, Chicago, (1998).
36. P. Juffs, E. Beggs, F. Deravi, “A Multiresolution Distance Measure for Images”, IEEE
Signal Processing Letters, 5(6), 138-140 (1998).
37. D. Andreutos, K. N. Plataniotis, A. N. Venetsanopoulos, “Distance Measures for Color
Image Retrieval”, IEEE International Conference On Image Processing, IEEE Signal
Processing Society, IEEE, Chicago, (1998).
38. http://ag.arizona.edu/classes/rnr555/lecnotes/10.html
39. W. K. Pratt, Digital Image Processing, New York, Wiley (1978).
38
40. P. E. Trahanias, D. Karakos, A. N. Venetsanopoulos, “Directional Processing of Color
Images: Theory and Experimental Results”, IEEE Trans. Image Process., 5(6), 868-880
(1996).
41. C. Zetsche, E. Barth, B. Wegmann “The Importance of Intrinsically Two-Dimensional
Image Features in Biological Vision and Picture Coding,” in Digital Images and Human
Vision, A. B. Watson, ed., Cambridge, MA, MIT Press, 109-138 (1993).
42. P. K. Rajan, J. M. Davidson, “Evaluation of Corner Detection Algorithms”, Proc. of
Twenty-First Southeastern Symposium on System Theory, 29-33, 1989.
43. J. Canny, “A Computational Approach to Edge Detection”, IEEE Trans. Pattern. Anal.
Mach. Intell., 8(6), 679-698 (1986) .
44. D. Carevic, T. Caelli, “Region Based Coding of Color Images Using KLT”, Graphical
Models and Image Processing 59(1), 27-38 (1997).
45. H. Tao, T. Huang, “Color Image Edge Detection using Cluster Analysis”, IEEE
International Conference On Image Processing, 834-836, IEEE Signal Processing
Society, IEEE, California, (1997)
46. P. E. Trahanias, A. N. Venetsanopoulos, “Vector Order Statistics Operators as Color
Edge Detectors”, IEEE Trans. Syst. Man Cybern., 26(1), 135-143 (1996).
47. M. M. Lipschutz, Theory and Problems of Differential Geometry, McGraw-Hill Inc.,
(1969).
48. M. McIvor, R. J. Valkenburg, “A Comparison of Local Surface Geometry Estimation
Methods”, Machine Vision and Applications, 10, 17-26 (1997).
49. İ. Avcıbaş, B. Sankur, “Statistical Analysis of Image Quality Measures”, 10. European
Signal Processing Conf., EUSIPCO-2000, 2181-2184, Tampere, Finland, 2000.
50. R. O. Duda and P. E. Hart, Pattern Recognition and Scene Analysis, New-York, Wiley,
(1973).
51. K. Popat, R. Picard, “Cluster Based Probability Model and It’s Application to Image and
Texture Processing”, IEEE Trans. Image Process., 6(2), 268-284 (1997).
52. T. M. Cover and J. A. Thomas, Elements of Information Theory, New York, Wiley
(1991).
53. M. Basseville, “Distance Measures for Signal Processing and Pattern Recognition”,
Signal Processing 18, 349-369 (1989).
54. N. B. Nill, “A Visual Model Weighted Cosine Transform for Image Compression and
Quality Assessment”, IEEE Trans. Commun., 33(6) 551-557 (1985).
55. T. Frese, C. A. Bouman, and J. P. Allebach, “A Methodology for Designing Image
Similarity Metrics Based on Human Visual System Models,” Tech. Rep. TR-ECE 97-2,
Purdue University, West Lafayette, IN, (1997).
56. A. B. Watson, “DCTune: A Technique for Visual Optimization of DCT Quantization
Matrices for Individual Images”, Society for Information Display Digest of Technical
Papers, XXIV, 946-949, 1993.
57. İ. Avcıbaş, N. Memon and B. Sankur, “Staganalysis of Watermarking Techniques Using
Image Quality Metrics”, SPIE Conference 4314: Security and Watermarking of
Multimedia Contents III, San Jose, USA, 2001.
58. İ. Avcıbaş, N. Memon and B. Sankur, “Steganalysis Based On Image Quality Metrics”,
to be presented, MMSP’2001: IEEE Workshop on Multimedia Signal Processing, Cannes,
France, 2001.
39
59. İ. Avcibaş, N. Memon and B. Sankur, “Steganalysis Using Image Quality Metrics”,
(under review), IEEE Transactions on Image Processing, 2001.
40