Main

The North Atlantic Oscillation (NAO)1 is the leading mode of atmospheric circulation variability in the North Atlantic, reflecting changes in the pressure gradient between the Azores High and the Iceland Low. In its positive phase, an increased pressure gradient drives stronger mid-latitude westerly winds, increased storminess, a poleward shift of the North Atlantic jet stream and storm track, warm and wet conditions in Northern Europe and southeastern North America, and cold and dry conditions in northeastern North America and Southern Europe2,3. Conditions are opposite during the negative NAO phase. Hence, the NAO severely impacts society, including through water security4, flooding5, mortality due to cold weather6, transport7, energy demand8 and supply9, structural damage from storms10 and economic losses11.

Understanding how the NAO will change in future decades is key for developing effective adaptation measures. However, there are three major challenges to overcome. First, model simulations of the NAO are highly chaotic, such that tiny changes that would be impossible to measure can lead to opposite trends12. If this is true for the real world, the future NAO will be highly uncertain due to irreducible internal variability13. However, there is mounting evidence that the real-world NAO is much more predictable than models suggest14,15,16,17,18,19,20,21. This model error has been called the ‘signal-to-noise paradox’ (SNP) because a climate model can predict the real world better than one of its own ensemble members despite perfectly representing itself22. This arises when the ratio of the predictable signal to unpredictable noise is too small in models. Consequently, the magnitude of the modelled ensemble mean is too small and must be inflated to obtain realistic and reliable predictions15,18,22. Although the causes of the SNP are currently unknown, there is evidence that weak atmospheric eddy feedback23,24 and/or errors in ocean–atmosphere interactions25,26 play a role. Both of these would be expected to affect all timescales, including climate projections for which there is already some evidence20,26,27. Hence, in this study, we tested for the SNP and made appropriate adjustments.

Second, simulations of atmospheric circulation depend on the model used, leading to large uncertainties28,29,30. However, model differences can potentially be exploited to estimate the real world using an emergent constraint31 if a robust physical relationship exists between model differences in projected changes and model differences in something that can be observed. The resulting regression enables uncertainties in future projections to be narrowed through a weighted model average, where the weights depend on how well each model represents the observed quantity32. Developing such a constraint was a major aim of this study.

Third, observed changes are at the limit or even beyond those simulated by climate models and the drivers of past multi-decadal changes in North Atlantic climate are very poorly understood. Of over 300 simulations from the latest Coupled Model Intercomparison Project (CMIP6), none captured the observed strengthening of the North Atlantic jet over the 70-year period 1951–2020, and only two simulated an NAO increase as large as that observed33. This is consistent with a wider and persistent inability of climate models to simulate observed multi-decadal changes in the North Atlantic, including jet speed and the NAO34,35,36,37, sea-level pressure in winter and summer38, sea surface temperatures39,40, ocean circulation41, and connections between sea surface temperatures and atmospheric circulation42. These discrepancies could arise if the observed variations are due to extremely rare internal variability. However, here we investigated the alternate possibility—that climate models do not reproduce the correct regional response to external forcings.

Opposite model responses

We began by investigating the response to natural forcings from volcanic eruptions and solar irradiance, which are simpler to interpret than simulations with all forcings, and can drive North Atlantic multi-decadal variability43,44,45,46,47,48. We used CMIP6 simulations covering the historical period 1850–2020 that were forced only by volcanic eruptions and solar variations (hereafter, hist-nat). We used all models with at least three ensemble members (Methods and Table 1) and diagnosed the forced response from the ensemble mean. Earth’s energy imbalance (EEI, the net imbalance at the top of the atmosphere between incoming solar and outgoing shortwave and longwave radiation) is a fundamental driver of climate change because the climate system will adjust to bring EEI back to equilibrium. To assess the atmospheric circulation response, we regressed the ensemble mean EEI for each model against its zonal mean zonal wind (\(\overline{u}\), where the overbar denotes zonal mean), focusing on 31-year rolling means for the extended boreal winter (October to March). We found similarities, but also key differences, between models and illustrate these for CanESM5 and CNRM-CM6-1 (Fig. 1a,b). For a positive EEI, both models show a horseshoe-shaped band of increased \(\overline{u}\) extending across the tropical tropopause around 100 hPa and down into the extratropical troposphere in both hemispheres. However, the latitude where increased \(\overline{u}\) reaches the surface was fundamentally different, such that CanESM5 shows a poleward shift of the climatological mid-latitude westerly winds (grey contours in Fig. 1), whereas CNRM-CM6-1 shows an equatorward shift. Consequently, the \(\overline{u}\) response to EEI is significantly opposite in these two models in the mid-latitudes (50–65° N) (stippling in Fig. 1a,b). Furthermore, time series of the NAO are oppositely correlated with observations for these models (Fig. 2a) (r = 0.67 for CanESM5, r = −0.48 for CNRM-CM6-1), consistent with a poleward (equatorward) shift of the jet during positive (negative) NAO.

Table 1 Model simulations and ensemble sizes
Fig. 1: Dynamical and thermal responses to Earth’s energy imbalance.
figure 1

a,b, Regression between boreal winter (October–March) 31-year EEI and zonal mean zonal wind (ms−1 per W m−2) from hist-nat simulations for CanESM5 (a) and CNRM-CM6-1 (b). c,d, Regression between boreal winter (October–March) 31-year EEI and zonal mean temperature (K per W m−2) for CanESM5 (c) and CNRM-CM6-1 (d). Grey contours show climatological (1985–2014) zonal mean zonal wind (contour interval 5 ms−1, negative values dashed). Stippling shows where the two models have significantly opposite signs (Methods). Yellow lines show the climatological hygropause defined by the specific humidity contour 10−5 kg kg−1 and yellow circles show the 200-hPa hygropause latitude, diagnosed as being where this specific humidity intersects the 200-hPa pressure level in the Northern Hemisphere. Magenta circles show the climatological jet latitude at 200 hPa (Methods). Note the nonlinear colour scale. All time series were linearly detrended before computing the regressions.

Fig. 2: Explaining model differences.
figure 2

a, Time series of detrended observed NAO (black curve) and ensemble mean hist-nat simulations for selected models (with at least 10 ensemble members). The correlation with observations and the RPS (if significant at the 5% level) (Methods) are indicated. b, Scatter plot of detrended hist-nat Northern Hemisphere 200-hPa jet latitude shift related to EEI against climatological 200-hPa hygropause latitude relative to the jet (Methods). The correlation is indicated (sample size = 16), along with the line of best fit (black) and the observed 200-hPa tropopause latitude relative to the jet (vertical grey line). Whiskers show 5–95% confidence intervals (Methods). c, Time series of observed NAO (black curves) and hist-nat multi-model mean simulations (yellow curve, ensemble mean solid curve, 5–95% confidence interval diagnosed from ensemble members shaded). The correlation between simulated and observed NAO is indicated. d, As for c but for calibrated hist-nat simulations based on the emergent constraint shown in b and the ratio of predictable signals (RPS), with shading showing the 5–95% confidence interval diagnosed from the root mean square error (r.m.s.e.) (Methods). All data are for boreal winter (October–March) 31-year rolling means. Significance is based on a two-sided t test in b and block bootstrapping (Methods) in a, c and d.

Explaining model differences

To understand the cause of these differences between models, we regressed EEI onto zonal mean temperature (\(\overline{T}\)) (Fig. 1c,d). Both models show a warm troposphere and cool stratosphere associated with a positive EEI. This creates an anomalous meridional temperature gradient (\(\scriptstyle\frac{\partial \overline{T}}{\partial \phi }\), where ϕ is latitude) around the tropopause (for example, 200 hPa) that would be expected to increase \(\overline{u}\) (refs. 49,50,51,52,53). Crucially, the latitude of the transition from warmer troposphere to cooler stratosphere at 200 hPa is fundamentally different in these two models, being poleward of the climatological jet latitude (the magenta circle shows the jet centroid) (Methods) in CanESM5, consistent with a poleward shift in \(\overline{u}\), as seen in simplified models54, but equatorward in CNRM-CM6-1, consistent with an equatorward shift in \(\overline{u}\).

Water vapour is Earth’s most potent greenhouse gas (GHG)55, raising global temperatures by absorbing infrared radiation throughout the troposphere and amplifying global warming via strong positive feedback56. Water vapour decreases sharply in the stratosphere57, and the decreased opacity allows emitted radiation to escape to space causing cooling58,59. A warm troposphere and cool stratosphere response to EEI (Fig. 1c,d) is consistent with this sharp gradient in water vapour. Moreover, climate models show large variations in water vapour60 that affect dynamical responses to global warming61. To investigate whether water vapour could explain the different model responses to EEI, we took the climatological (1985–2014) specific humidity contour 10−5 kg kg−1 (yellow curves in Fig. 1c,d) as the transition between the troposphere and stratosphere (hereafter, the hygropause) and found that this intersects 200 hPa (yellow circles) poleward and equatorward of the jet latitude (magenta circles) in CanESM5 and CNRM-CM6-1, respectively, consistent with their opposite jet shifts. Furthermore, we found a significant relationship (r = 0.62, P = 0.01) (Fig. 2b) across all 16 hist-nat model simulations (Table 1) between the hygropause latitude relative to the jet (\(\Delta {\phi }_{j}^{h}\), where h refers to the hygropause latitude and j to the jet latitude) and the jet shift related to EEI (Methods).

Exploiting model differences

All hist-nat simulations underestimate \(\Delta {\phi }_{j}^{h}\) compared to observations (Fig. 2b, vertical grey line), consistent with a general underestimation of stratospheric water vapour60. This suggests (1) the real-world jet would shift poleward in response to a positive EEI and (2) it would shift further than any of the models. Given its climatological nature and significant correlation with the response, \(\Delta {\phi }_{j}^{h}\) can be used as an emergent constraint to define weights for each model32, such that models that are close to the observed value receive larger weights and those that shift the jet in the wrong direction (that is, equatorward for positive EEI) can receive negative weights (Methods). The resulting time series is more significantly correlated with observations (r = 0.42, P = 0.01) (Fig. 2d) than the unweighted multi-model ensemble mean (r = 0.24, P = 0.11) (Fig. 2c). However, the variability of the constrained time series is too small relative to the correlation skill, consistent with the SNP. We therefore multiplied by the ratio of predictable signals between observations and models (RPS, 4.4)18 and computed uncertainties from the root mean squared error (r.m.s.e.) to achieve the calibrated time series (Fig. 2d) (Methods).

The significant correlation (Fig. 2d) suggests a role for natural forcings, but the minimum in 1980 is later than the observed minimum (1950), suggesting other factors are also important. We investigated whether the same emergent constraint applies to historical simulations that include all forcings (Table 1). Again, we found a range of NAO responses (Fig. 3a), with some models significantly correlated with observations (MIROC6, r = 0.75, P = 0.00; UKESM1-0-LL, r = 0.62, P = 0.00). We linearly detrended (Methods) to highlight decadal variability and found a significant correlation across the 25 models between \(\Delta {\phi }_{j}^{h}\) and the jet shift related to EEI (r = 0.45, P = 0.02) (Fig. 3b), showing that \(\Delta {\phi }_{j}^{h}\) also explains some of the model differences when all forcings are included. The resulting calibrated time series (Methods) is significantly correlated with observations (r = 0.64, P = 0.00) in contrast to the unweighted multi-model mean (Fig. 3c,d). Similar to hist-nat, the variability of the constrained time series is too small relative to the correlation skill, consistent with the SNP, and scaling by the RPS (4.4) is required for realistic and reliable simulations.

Fig. 3: Exploiting model differences to improve projections.
figure 3

a, Time series of observed NAO (black curve, Hadley Centre Sea Level Pressure (HadSLP)) and ensemble mean historical and SSP2-4.5 simulations for selected models. The correlation with observations and the RPS (if significant at the 5% level) are indicated. b, Scatter plot of detrended historical northern hemisphere 200-hPa jet latitude shift related to EEI against climatological 200-hPa hygropause latitude relative to the jet (Methods). The correlation is indicated (sample size = 25) along with the line of best fit (black) and the observed 200-hPa tropopause latitude relative to the jet (vertical grey line). Whiskers show 5–95% confidence intervals (Methods). All time series have been linearly detrended for this panel only. This panel uses all historical simulations (Table 1). c, Time series of observed NAO (black curves: thick, HadSLP dataset; thin, Twentieth Century Reanalysis (20CR) dataset) and historical and projected (with scenarios SSP5-8.5, SSP2-4.5 and SSP1-2.6) multi-model simulations (colours, multi-model ensemble mean solid curve, 5–95% confidence interval diagnosed from ensemble members shaded). The correlation between simulated and observed NAO is indicated. d, As for c but for calibrated simulations based on the emergent constraint shown in b and the RPS, with shading showing the 5–95% confidence interval diagnosed from the r.m.s.e. (Methods). All data are for boreal winter (October–March) 31-year rolling means. Significance is based on a two-sided t test in b and block bootstrapping (Methods) in the other panels. For the historical period, only ensemble members for historical + SSP2-4.5 (Table 1) are shown leading to discontinuities at the start of future projections for the other scenarios.

Implications for future projections

Our goal is to provide improved projections of the NAO. We considered three emissions scenarios (Methods)—SSP5-8.5, representing high emissions, SSP2-4.5, with medium mitigation, and SSP1-2.6 for strong mitigation. Even for a single scenario (SSP2-4.5), the uncertainties are huge (Fig. 3a), with some models projecting a strong increase (GISS-E2-1-G, IPSL-CM6A-LR, CanESM5), others projecting a decrease (EC-Earth3, UKESM1-0-LL) and some showing little change (MIROC6, MIROC-ES2L, ACCESS-ESM1-5). As noted previously62, unweighted multi-model means show a slight negative trend under SSP1-2.6, a modest positive trend for SSP2-4.5 and a larger positive trend for SSP5-8.5 towards values near the upper range of the historical observations by the end of the century (Fig. 3c). However, calibrated projections suggest a much larger increase under SSP5-8.5 to unprecedented values around three times larger than those in the historical record (Fig. 3d). A strong impact of mitigation is also clear, with the NAO decreasing from mid-century under SSP2-4.5 and earlier under SSP1-2.6.

Physical mechanism

To gain further confidence in the clear differences in calibrated projections (Fig. 3d), we investigated the physical mechanism. Latitude–year Hovmuller plots of rolling 31-year \(\overline{T}\) anomalies at 200 hPa (Fig. 4a) for hist-nat show tropical cooling following major volcanic eruptions (Krakatau in 1883, Agung in 1963, El Chichon in 1982 and Pinatubo in 1991). This weakens \(\scriptstyle\frac{\partial \overline{T}}{\partial \phi }\) and shifts the jet equatorward54, resulting in a tendency towards a negative NAO (Fig. 2d). Conversely, tropical warming, due to the lack of volcanic eruptions in the first half of the twentieth and twenty-first centuries, strengthens \(\scriptstyle\frac{\partial \overline{T}}{\partial \phi }\) and increases the NAO. In hist-nat, the peak cooling occurs around 1990 due to the combined eruptions of Agung, El Chichon and Pinatubo. However, GHG warming throughout the twentieth century results in an earlier minimum in both \(\overline{T}\) and NAO (Fig. 3d) in the historical simulations. Under SSP5-8.5, the tropics continue to warm, whereas mitigation in SSP2-4.5 and especially SSP1-2.6 decreases tropical \(\overline{T}\). The resulting trends in \(\scriptstyle\frac{\partial \overline{T}}{\partial \phi }\) explain the main features of the constrained time series (Fig. 4b,c), although there are probably other processes63,64,65,66,67 that should be investigated in the future.

Fig. 4: Physical mechanism.
figure 4

a, Latitude–year Hovmuller plots of zonal mean temperature at 200 hPa relative to the preceding 30-year mean, shown for rolling 31-year means. Panels are shown for the historical, hist-nat, SSP1-2.6, SSP2-4.5 and SSP5-8.5 unweighted multi-model ensemble means. Vertical lines indicate 35° N. b, Time series of calibrated hist-nat NAO (thin yellow line, left axis) and multi-model ensemble mean 31-year rolling trend of \(-\frac{\partial \overline{T}}{\partial \phi }\) at 35° N (thick yellow line, right axis, minus sign because \(\frac{\partial \overline{T}}{\partial \phi }\) is negative in the Northern Hemisphere). c, As for b but for historical simulations and future projections. All data are for boreal winter (October–March).

Discussion

We propose that some of the large model spread in multi-decadal NAO simulations can be explained by differences in background water vapour, which controls the latitude of upper tropospheric heating. An emergent constraint, based on the climatological hygropause latitude relative to the jet, reveals a strong impact of external forcing on multidecadal NAO variability in contrast to the unweighted model mean68. Emergent constraints have sometimes been shown to fail when applied to updated models69,70,71,72. This potentially indicates overfitting and non-robustness, although it might be expected because emergent constraints potentially motivate model improvements by highlighting errors. Either way, emergent constraints revert to the unweighted multi-model mean if they do not explain any of the model spread32. Nevertheless, it is important to test for robustness31. Our constraint is underpinned by the thermal wind balance, a basic physical principle, and supported by idealized model experiments54. Furthermore, it applies to multiple simulations (hist-nat, historical and three projections, and other out-of-sample simulations) (Extended Data Figs. 1 and 2). In particular, simulations in which carbon dioxide (CO2) is increased by 1% per year show an equatorward shift of the jet for models whose \(\Delta {\phi }_{j}^{h}\) is negative and a poleward shift for models whose \(\Delta {\phi }_{j}^{h}\) is positive (Extended Data Fig. 1b,c), suggesting that our constraint will apply to future increases in GHGs. Moreover, unlike constraints on future projection, we also show that our constraint improves correlations with historical observations.

Our results suggest that multi-decadal variations in the NAO are largely driven by trends in the upper troposphere meridional temperature gradient \(\scriptstyle\frac{\partial \overline{T}}{\partial \phi }\), similar to the effect on upper-level winds found in earlier studies73. Major volcanic eruptions cool the tropical upper troposphere, weakening \(\scriptstyle\frac{\partial \overline{T}}{\partial \phi }\) and resulting in a tendency towards negative NAO, whereas GHGs warm the tropical upper troposphere and cool the stratosphere, strengthening \(\scriptstyle\frac{\partial \overline{T}}{\partial \phi }\) and resulting in a tendency towards positive NAO. Although volcanoes and GHGs dominated the period since 1850, solar variations can also affect \(\scriptstyle\frac{\partial \overline{T}}{\partial \phi }\)47,48 and could be important in other periods74. We stress that our emergent constraint only explains some of the model differences and that other drivers could also be important63,64,65,66,67 and should be investigated in the future.

Our constrained time series required scaling by the RPS (4.4) to achieve realistic and self-consistent values, consistent with the SNP18,22. Despite uncertainties in the RPS (Methods), our calibrated projections suggest that, under the high-emissions scenario (SSP5-8.5) and in the absence of major volcanic eruptions, the multi-decadal NAO will increase to unprecedented levels by 2100, 1.4 to 3.7 times greater than any in the historical period. This will probably cause severe impacts on society, especially increased flooding and storm damage in Northern Europe. However, we found a clear capability for mitigation to avoid such impacts. There is substantial and mounting evidence that models suffer from the SNP14,15,16,17,18,19,20,21,22,24,26,27. Until this error is fixed, large ensembles will be key in quantifying the forced response75. Given the systematically different model responses shown here and in numerous previous studies28,29,30,76,77, the models cannot all be correct, and they must ultimately be improved to achieve reliable projections. The water vapour errors highlighted here potentially provide a step forward, but the SNP remains to be resolved. However, taking model output at face value and seeking model consensus may underestimate the magnitude of future regional climate change. Alternative recalibration approaches, such as those proposed here, are needed to prepare society for the full range of possible extremes.

Methods

Models simulations

We used all CMIP6 (ref. 78) simulations available at the time that provided at least three ensemble members for historical (all forcings, 1850 to 2014) and projection scenarios following the medium mitigation79 SSP2-4.5 scenario (2015 to 2100), giving 25 models (Table 1). For the same models, we also used projections following the high-emissions (SSP5-8.5) and strong mitigation (SSP1-2.6) scenarios. To assess the impact of natural solar and volcanic forcings, we used the hist-nat simulations from the Detection and Attribution Model Intercomparison Project (DAMIP)80 with at least three ensemble members, giving 16 models. The ensemble sizes for some models were increased following the Large Ensemble Single Forcing Model Intercomparison Project (LESFMIP)75.

For each model, we assessed the forced signal as the ensemble mean (the unweighted average of all ensemble members) and assessed rolling 31-year boreal winter (October to March) means. Indices were calculated using the original model grids, and fields were interpolated to a common 3.75° longitude by 2.5° latitude grid. We assessed data for the following pressure levels: 10, 50, 70, 100, 150, 200, 250, 300, 500, 700, 850 and 925 hPa.

Observations

Mean sea-level pressure observations were taken from HadSLP2 (ref. 81) and the 20CR (v.3)82. Water vapour observations were taken from the Stratospheric Water and Ozone Satellite Homogenized (SWOOSH) database83. Given its central role in our emergent constraint, uncertainties in water vapour would ideally be assessed using multiple datasets. However, there are serious biases in reanalysis estimates of water vapour84,85, highlighting the need for additional observations.

Indices

The NAO index is calculated as the difference in mean sea-level pressure between two small boxes located around the Azores (28–20° W, 36–40° N) and Iceland (25–16° W, 63–70° N), with the average over the period 1850–2014 removed to create anomalies16.

Hygropause latitude relative to the jet

We calculated jet latitude at 200 hPa using a centroid measure of the sum of the latitude weighted by the square of the westerly zonal mean wind86:

$${\phi }_{j}=\frac{\mathop{\int}\nolimits_{20}^{70}\phi {\overline{u}}^{\,2}{\mathrm{d}}\phi }{\mathop{\int}\nolimits_{20}^{70}{\overline{u}}^{\,2}{\mathrm{d}}\phi }$$
(1)

where Ï• is latitude and \(\overline{u}\) is zonal mean zonal wind.

We define the hygropause latitude (ϕh) as the latitude in the Northern Hemisphere at which the climatological (1985–2014) water vapour contour 10−5 kg kg−1 intersects the 200-hPa pressure level. Note that we do not use the hygropause defined as the water vapour minimum because this does not necessarily capture the water vapour heating and cooling effects that are important here.

The hygropause latitude relative to the jet is then:

$$\Delta {\phi }_{j}^{h}={\phi }_{h}-{\phi }_{j}$$
(2)

Jet shifts related to EEI

To obtain jet shifts related to EEI, we calculated ϕj using climatological \(\overline{u}\) (1985–2014) plus the regression of \(\overline{u}\) onto EEI, and removed ϕj, obtained from climatological \(\overline{u}\).

Calibration

We performed a two-step calibration:

  1. (1)

    Exploit model differences using an emergent constraint.

  2. (2)

    Correct the resulting time series for signal-to-noise errors.

Our emergent constraint based on the hygropause latitude relative to the jet (the x axes in Figs. 2b and 3b) allowed an ensemble regression approach32 and defined weights for each model, such that models close to the observed value generally received high weights and those that shifted the jet in the wrong direction could receive negative weights. Note that the weights only depend on the x axes in Figs. 2b and 3b and are given by:

$${w}_{m}=f({x}_{m}-\overline{x})+1/M$$
(3)
$$f=\frac{{x}_{{\mathrm{ob}}}-\overline{x}}{{\Sigma }_{m = 1}^{M}{({x}_{m}-\overline{x})}^{2}}$$
(4)

where wm is the weight for model m, M is the total number of models, xm is the x axis value (that is, \(\Delta {\phi }_{j}^{h}\)) for the model, \(m,\overline{x}\) is the average xm over all models, and xob is the observed x axis value.

Where the SNP22 occurs, the magnitude of the constrained time series will be inconsistent with the variance that can be explained by its correlation with the observations. Furthermore, a multi-model average can have a muted response if model differences are partially cancelled. In either case, scaling by the RPS between observations and models18 is required to achieve realistic and reliable predictions and projections. This scaling is equivalent to regressing the constrained time series against observations87. The possibility of internal variability aligning with the forced response cannot be ruled out, but this would rely on coincidence. Given the substantial evidence for the SNP and the potential impacts of unprecedented future NAO, it would be prudent for society to prepare for this possibility.

To focus on multi-decadal variability, the regressions are based on linearly detrended time series. We also linearly detrended each NAO time series (using the trend over the historical period) before applying the emergent constraint, and calculated the RPS using linearly detrended observations. We then scaled by the RPS and added the unweighted multi-model ensemble mean trend (projected to 2100) to obtain the final calibrated NAO. For each scenario, we used the same models for the historical period and projections. By using the multi-model ensemble mean trend, we captured the common thermodynamic influences (for example, heat low responses) on the NAO. It should be noted that detrending does not affect the model weights, which only depend on the hygropause latitude relative to the jet.

To sample observational uncertainty71, we took the average RPS, computed using both HadSLP and 20CR. Uncertainties (shaded in Figs. 2d and 3d) were computed from the r.m.s.e. between the calibrated time series and the two observations, as required for reliable forecasts18,88. We used the maximum r.m.s.e. from HadSLP and 20CR to provide conservative uncertainty estimates.

The RPS is uncertain, but significantly greater than one (central estimate 4.4, 5−95% confidence interval, 1.8–7.6 for the historical simulations). Using this range of RPS for the high-emissions scenario, we estimated that the 31-year NAO will increase to between 3.7 and 10.1 hPa by the end of the century. The maximum observed NAO is 1.9 and 2.7 hPa in HadSLP and 20CR, respectively. Hence, we estimate the NAO will increase to unprecedented levels at least 1.4 to 3.7 times larger than previously observed in the instrumental record.

Significance

We tested for values that were unlikely to be accounted for by uncertainties arising from a finite ensemble size (E) and a finite number of validation points (N). This was achieved using a non-parametric block bootstrap approach18,89,90, in which an additional 1,000 samples were created, as follows:

  1. (1)

    Randomly sample with replacement N validation cases. To take autocorrelation into account, this is done in blocks of 15 consecutive cases.

  2. (2)

    For each of these, randomly sample with replacement E ensemble members.

  3. (3)

    Compute the required statistic for the ensemble mean (for example, regression, correlation, RPS).

  4. (4)

    Repeat step (1) 1,000 times to create a probability distribution.

  5. (5)

    Obtain the significance level based on a two-tailed test of the hypothesis that skill is zero, or RPS is one.

Step (2) was omitted for the constrained time series, which is a single realization. Subsampling tests, using models with large ensembles, show that regression uncertainties are underestimated by small ensembles that do not sample a wide range of outcomes. We therefore plotted 5–95% regression uncertainty averaged over all models, with at least 50 ensemble members. In Fig. 1, stippling shows where the models have opposite signs and their 5–95% uncertainties do not overlap.