TR 1
TR 1
Abstract
This report reviews existing literature describing forecast accuracy metrics, concentrating on
those based on relative errors and percentage errors. We then review how the most common of
these metrics, the mean absolute percentage error (MAPE), has been applied in recent radiation
belt modeling literature. Finally, we describe metrics based on the ratios of predicted to observed
values (the accuracy ratio) that address the drawbacks inherent in using MAPE. Specifically
we define and recommend the median log accuracy ratio as a measure of bias and the median
symmetric accuracy as a measure of accuracy.
1 Introduction
The utility, or value, of any forecast model is determined by how well the forecast predicts the
quantities being modeled. There exists, however, a wide range of metrics to assess forecast quality
and a similarly wide range of views on just what a “good” forecast is. One key measure of the
quality of a forecast is in how much it deviates from the observation and that is what will be
discussed in this report. We begin by briefly introducing some quantitative attributes of forecast
performance, followed by definitions of metrics to quantitatively describe these attributes in the
case of continuous predictands.
Although a forecast is strictly a prediction of events that have not yet occurred, this report
treats simulation results as a forecast, regardless of the time interval.
where n is the number of observation-forecast pairs, and the subscript i denotes the ith element
of the series. In our previous example the forecast values were consistent underestimates of the
observed value. Forecasts that, on average, over- or under-estimate the observed value display
bias. Calculating the M E for the example above we have ((−2) + (−4))/2 = −3 nT. A negative
number indicates a systematic under-prediction, whereas a positive bias would indicate a systematic
over-prediction.
For data that have different scales, scale-independent accuracy measures are often recommended.
Although the variability in electron fluxes at a given location and energy can be large, scale-
dependent measures would still be appropriate. However, there can be several orders of magnitude
difference between electron fluxes at L' 4 and geosynchronous orbit, with each location displaying
different levels of variability. Thus comparing scale-dependent accuracy measures can be problem-
atic. Similarly, the measurements across a single orbit of a satellite in a highly-elliptical orbit cover
regions that could be argued to be of different scale and dynamics.
One approach to giving more equal weight to errors across several orders of magnitude is to use
metrics that are based on relative errors (including percentage errors) or are otherwised scaled to
normalize the errors. Alternatively, the data themselves can be transformed through the application
of a power function, such as taking logarithms or applying a Box-Cox transform [Wilks, 2006].
By transforming the data this way, the use of scale-dependent accuracy measures may be better
justified, as well as application of methods that assume homoscedasticity [Sheskin, 2007]. We note,
however, that transforming the data alters the scale and may invalidate the assumptions behind
other analyses. We will first introduce scale-dependent metrics, followed by two classes of scale-
independent metrics. We will then focus more closely on the mean absolute percentage error and
its use in recent literature, before proposing a new accuracy measure based on relative errors.
It can be seen that the mean squared error is analogous to the variance, and penalizes large errors
more heavily than small errors. When fitting a regression model, use of ordinary least squares (OLS)
As we are concerned with estimating the accuracy of a forecast, which will likely not be derived
from an OLS regression model, the decision of which error metric should be used depends on the
relative cost of different errors. Two pertinent questions here are:
1. If the error doubles, is this twice as bad? Or is it more than twice as bad?
2. Is an overestimate worse than an underestimate of the same magnitude?
The two questions can be equivalently phrased as “What is the form of the cost function (e.g.,
linear or quadratic)?”, and “Is the cost function symmetric?” If we wish to reduce the penalty on
large errors – i.e., use a linear loss function, rather than a quadratic loss function – we can use the
mean absolute error (MAE). This is defined as
n
1X
M AE = |ε| (5)
n
i=1
This metric is more resistant to outliers, as it uses |ε| rather than ε2 . It may, therefore, be more
appropriate in cases where the errors are not normally distributed or where large forecast errors
are not required to be weighted more heavily.
Both the MSE and MAE estimate the location (central tendency) of the error distribution using
the mean. As the mean is not a robust measure, we can improve the robustness of our accuracy
metric by using a common robust measure of location: the median. Replacing the mean function
in equation 5 with the median function (M ) gives us the median absolute error (MdAE).
A good summary of unscaled measures of accuracy and bias can be found in Walther and Moore
[2005]. We note here that unscaled metrics imply that deviations of the same magnitude have
equal importance at different magnitudes of the base quantity. For example, an error of ε = 100 is
penalized equally at x = 103 and x = 106 .
We note here that relative and percentage error metrics imply that deviations of the same order
have equal importance at different magnitudes of the base quantity. For example, an error of
ε = 100 where x = 103 has an equal penalty to an error ε = 1 where x = 10 – both give a relative
error of 0.1, and thus a percentage error of 10%.
MAPE is used across many different fields of research, from population research [e.g. Swanson
et al., 2000] to business forecasting [e.g. Kohzadi et al., 1996] and atmospheric science [e.g. Grillakis
et al., 2013; Zheng and Rosenfeld , 2015]. MAPE has also been used in validation of radiation belt
models [Kim et al., 2012; Tu et al., 2013; Li et al., 2014], and these are discussed further in section
2. However, MAPE is not without problems that may be pertinent for radiation belt forecasts.
The following problems have been noted by various authors:
1. MAPE becomes undefined when the true value is zero. [Hyndman and Koehler , 2006]
2. MAPE is asymmetric with respect to over- and under-forecasting. [Makridakis, 1993; Hynd-
man and Koehler , 2006; Tofallis, 2015]
We note that MAPE is not an appropriate metric where the quantity being modeled can be
zero. Indeed, Tofallis [2015] note that APE “is generally only used when the quantity of interest is
strictly positive”. We also note that unless the data used are ratio-level data [Sheskin, 2007], the
APE has limited meaning [Hyndman and Anathasopoulos, 2014]. For example, radiation belt fluxes
are constrained to be positive and the units of flux have a true zero (which, practically speaking,
is unlikely to be encountered), therefore APE can be used for radiation belt flux predictions.
To elaborate on the second point, a prediction of 1000 where the observed value is 500 gives a
different magnitude of error (100%) than a prediction of 500 where the observed value is 1000 (50%).
Under-prediction is therefore less heavily penalized that over-prediction, even if the magnitude of
the error is the same.
Given that APE have a lower bound of zero but have no upper bound, they are likely to be
skewed positive. Take a case where the forecast errors are distributed approximately normally, and
where the error is scaled based on the error expected from a persistence forecast. The benefit of
this scaling is that a scaled error is below 1 if the forecast outperforms the average error from
the persistence forecast. This method was developed for time series data, but is not appropriate
where the observing location changes with time (such as flux data from a satellite in a highly
elliptical orbit). In this case, the appropriate forecast to scale by might be “orbital persistence”;
assuming the orbital characteristics change slowly then on successive orbits the satellite will return
to approximately the same location and the value at that location is then taken instead of xi−1 . A
further modification of the scaled error was given by Hyndman and Anathasopoulos [2014], where
the scaling is by comparison with the error compared to the mean forecast. In the case of a satellite
covering a wide range of locations the mean forecast is also inappropriate. However, the error
relative to a climatological mean for the current location of the satellite provides a scaling that is
meaningful.
x i − yi
qi = n (12)
1 P
n−1 |xi − ci |
i=2
where ci is the climatological prediction for the location of measurement xi . The mean absolute
scaled error (MASE) is then given by
n
1X
M ASE = |qi | (13)
n
i=1
It can be seen that this is essentially the mean absolute error of the model, normalized by the mean
absolute error of a benchmark (here a climatological model). When computing MASE for a number
of models, we note that if a climatological model is used then a MASE of exactly 1 will result. The
scaling is also the same for all models being compared as the data and the climatological values will
not change, only the forecast values change with each different model. Though this metric makes
comparison of models intuitive, this metric suffers from the actual values being difficult to interpret
as a magnitude of error; the ratio of mean absolute error to mean absolute error of a benchmark
gives little direct information about the size of the errors in the specific model. For this reason,
MAPE remains very popular and we will discuss its use in radiation belt studies before proposing
an alternative that addresses some of its drawbacks.
which gives the intuitive result of a 70% error. When we log-transform the data and repeat this
process we find:
log(1.7 × 105 ) − log(1 × 105 )
M AP E = 100 (16)
log(1 × 105 )
log(1.7 × 105 /1 × 105 )
= 100
(17)
log(1 × 105 )
' 4.6% (18)
Note that by taking logarithms we now normalize the log of the flux ratio by the log of the measured
flux, thereby breaking the interpretation of MAPE with respect to the predicted quantity. This
quantity now varies with the magnitude of the observation. If our prediction-observation pair is
(1.7 × 105 , 105 ) or (1.7 × 102 , 102 ) then calculating MAPE on untransformed data gives 70% in
both cases. Calculating the MAPE using the log of these data gives 4.6% and 11.5%. Put another
way, by log-transforming the data we no longer strictly have ratio-level data, MAPE no longer has
an intuitive meaning, and MAPE is no longer an appropriate metric.
To predict the effective dose of galactic cosmic radiation received on trans-polar aviation routes,
Hwang et al. [2015] developed a model that forecasts the heliocentric potential (HCP) from a lagged
time-series of monthly sunspot number. The HCP is a required input for the Federal Aviation
Administration’s CARI-6M software for dose estimation. The modeled HCP presented by Hwang
et al. [2015] shows less variability than the observed HCP, with a tendency for the low values to be
slightly overpredicted and the high values to be significantly underpredicted. Since MAPE more
heavily penalizes the overprediction, it is possible that the accuracy reported by Hwang et al. [2015]
overstates the true accuracy of the model.
Zhelavskaya et al. [2016] have developed a neural network to predict the frequency of the upper-
hybrid resonance to derive electron number densities in the inner magnetosphere, using Van Allen
Probes electric field data. These authors used MAPE to assess the accuracy of their predictions,
both in predicted frequency and predicted number density. We note that the electron number
density, like radiation belt electron flux, is constrained to be positive and has a physically meaningful
zero. Further, the electron number density can vary by orders of magnitude over a single orbit as
well as at a fixed location due to dynamical processes.
this quantity also represents a robust measure of bias, though it suffers from a lack of intuitive
interpretability.
3.1 Accuracy
We here propose a measure of accuracy that use logarithms of the accuracy ratio, thereby mitigating
many of the problems inherent in using MAPE, but that maintain the interpretability of MAPE.
Specifically, we follow the lead of Tofallis [2015] and Morley et al. [2016] in using log(Q), but modify
our accuracy metric such that it is interpretable as a percentage error.
We begin by taking the absolute values of log(Q), then exponentiating to return it to the
original units and scale. This transformation ensures that the metric is symmetric in the sense that
switching the values of the predicted and observed value give the same error (unlike MAPE).
R = exp(|log(Q)|) (21)
This can also be seen to be the “matching ratio” of Chen et al. [2007].
1
Kitchenham et al. [2001] present an interesting discussion of the relationship of Q to the mean absolute error and
recommend this as a measure of prediction accuracy
3.2 Bias
As we take the absolute values of log(Q) we lose information about systematic bias in the prediction.
Therefore, by removing the modulus (and the transformation to a percentage scale) we recover a
measure of bias: the median accuracy ratio, βM
βM = exp(M (log(Qi )) (23)
= M (Qi ) (24)
As the median is a rank order statistic it is invariant with respect to the log-transform and the
exponentiation, thus we can simplify this to equation 24 [cf. Morley et al., 2016]. If we instead
consider a different location function, this simplification may not hold. If the distribution of log(Q)
is reasonably symmetric then equation 24 will approximate the arithmetic mean of log(Q). If we
substitute the median function with the mean function (µ), we recover the geometric mean of Q.
We can therefore define a related metric, the geometric mean accuracy ratio, βµ :
βµ = exp (µ (log(Qi ))) (25)
n
!1
Y n
= Qi (26)
i=1
Either of these measures will give values smaller than 1 for a systematic underprediction, and values
greater than 1 for a systematic overprediction. Although βµ has a clearer mathematical origin, we
prefer βM as it is more resistant to outliers and is more intuitive, due to its use of the median. The
physical meaning of the accuracy ratio is also clear, making the median accuracy ratio an easily
interpretable quantity. It should be noted that βM is not symmetric about 1. If symmetry about
1 is required then the final exponentiation should not be taken, leaving us with the median log
accuracy ratio, as plotted by Morley et al. [2016] and given in equation 20. This final metric has
the benefit that underprediction will give a negative value of M (log(Q)) and over-prediction will
give a positive value; an unbiased forecast will yield M (log(Q)) = 0. This symmetry about zero
then mirrors the more common measure of bias, the mean error. For this reason, we recommend
M (log(Q)) as a measure of bias. The choice of base will determine the level of interpretability for
any given data set.
4 Summary
We have introduced a number of commonly-used forecast metrics, including the mean absolute
percentage error (MAPE). A literature review has revealed a number of known problems with the
To indicate bias in a symmetric manner we recommend the median log accuracy ratio (MdLQ)
yi
M dLQ = M log
xi
as used by Morley et al. [2016], as it can be interpreted similarly to the mean error where negative
values indicate a systematic under-prediction and positive values indicate a systematic overpredic-
tion. By using base 10 logarithms an order of magnitude difference is given by MdLQ=1, and a
factor of 2 difference is given by MdLQ'0.3.
References
Chen, Y., R. H. W. Friedel, G. D. Reeves, T. E. Cayton, and R. Christensen (2007), Multisatellite
determination of the relativistic electron phase space density at geosynchronous orbit: An in-
tegrated investigation during geomagnetic storm times. Journal of Geophysical Research: Space
Physics, 112 (A11), A11214, doi:10.1029/2007JA012314.
Grillakis, M. G., A. G. Koutroulis, and I. K. Tsanis (2013), Multisegment statistical bias correction
of daily GCM precipitation output. Journal of Geophysical Research: Atmospheres, 118 (8), 3150–
3162, doi:10.1002/jgrd.50323.
Hwang, J., K. C. Kim, K. Dokgo, E. Choi, and H. P. Kim (2015), Heliocentric potential (HCP)
prediction model for nowscast of aviation radiation dose. J. Astron. Space Sci., 22 (1), 39–44,
doi:10.5140/JASS.2015.32.1.39.
Hyndman, R. J. and A. B. Koehler (2006), Another look at measures of forecast accuracy. Inter-
national Journal of Forecasting, 22 (4), 679 – 688, doi:10.1016/j.ijforecast.2006.03.001.
Kim, K.-C., Y. Shprits, D. Subbotin, and B. Ni (2012), Relativistic radiation belt electron responses
to GEM magnetic storms: Comparison of CRRES observations with 3-D VERB simulations.
Journal of Geophysical Research: Space Physics, 117 (A8), A08221, doi:10.1029/2011JA017460.
Li, Z., M. Hudson, and Y. Chen (2014), Radial diffusion comparing a THEMIS statistical model
with geosynchronous measurements as input. Journal of Geophysical Research: Space Physics,
119 (3), 1863–1873, doi:10.1002/2013JA019320.
Makridakis, S. (1993), Accuracy measures: theoretical and practical concerns. International Journal
of Forecasting, 9 (4), 527 – 529, doi:http://dx.doi.org/10.1016/0169-2070(93)90079-3.
Morley, S. K., J. P. Sullivan, M. G. Henderson, J. B. Blake, and D. N. Baker (2016), The Global
Positioning System constellation as a space weather monitor: Comparison of electron measure-
ments with Van Allen Probes data. Space Weather, 14 (2), 76–92, doi:10.1002/2015SW001339,
2015SW001339.
Reeves, G. D., S. K. Morley, R. H. W. Friedel, et al. (2011), On the relationship between relativis-
tic electron flux and solar wind velocity: Paulikas and Blake revisited. Journal of Geophysical
Research: Space Physics, 116 (A2), A02213, doi:10.1029/2010JA015735.
Swanson, D. A., J. Tayman, and C. F. Barr (2000), A note on the measurement of accuracy for
subnational demographic estimates. Demography, 37 (2), 193–201.
Tofallis, C. (2015), A better measure of relative prediction accuracy. J. Oper. Res. Soc., 66 (8),
1352–1362.
Walther, B. A. and J. L. Moore (2005), The concepts of bias, precision and accuracy, and their use
in testing the performance of species richness estimators, with a literature review of estimator
performance. Ecography, 28 (6), 815–829, doi:10.1111/j.2005.0906-7590.04112.x.
Wilks, D. S. (2006), Statistical methods in the atmospheric sciences, 2nd Edition. Academic Press.
Yu, Y., J. Koller, V. K. Jordanova, et al. (2014), Application and testing of the L* neural network
with the self-consistent magnetic field model of RAM-SCB. Journal of Geophysical Research:
Space Physics, 119 (3), 1683–1692, doi:10.1002/2013JA019350.
Zheng, Y. and D. Rosenfeld (2015), Linear relation between convective cloud base height and
updrafts and application to satellite retrievals. Geophysical Research Letters, 42 (15), 6485–6491,
doi:10.1002/2015GL064809, 2015GL064809.