Introduction

Global surface temperature over the past several decades was significantly warmer than any other decade since 1850. This temperature increase is not happening at equal rates globally, with Northeast Asia, including Mongolia and its surroundings, as one of the hotspots experiencing the strongest warming since the 1950s1. The increased anthropogenic greenhouse gases (GHG) emissions2,3,4,5 and the decreased anthropogenic aerosols (AA) outside Asia6,7 have been reported to be the main drivers of the warming around Northeast Asia. Previous studies have verified that external forcings can only explain approximately half of the multidecadal surface air temperature (SAT) variations in Northeast Asia4. Natural climate variability, such as the two Northern Hemisphere’s decadal modes—the Atlantic Multidecadal Variability (AMV)8,9 and the Interdecadal Pacific Oscillation (IPO)10,11, has also been qualitatively confirmed to play a crucial role in the SAT change over East Asia2,12. For thermodynamic process, the upward sensible and latent heat are enhanced during the AMV positive phase and further causes surface warming in the Northern Hemisphere13,14,15,16. In addition, AMV modulates the decadal changes in the atmospheric Silk Road teleconnection pattern across the mid-latitude and amplifies the warming in Northeast Asia8. The other decadal mode in the Northern Hemisphere, IPO, affects the SAT in East Asia through large-scale zonal circulation changes in the mid-latitude Asia11. IPO induces abnormal warming in Northeast Asia with an anticyclone anomaly. The warm phase of IPO can force a low-level meridional wave train stretching from the tropical northwest Pacific to East Asia, which leads to an anomalous anticyclone in Northeast Asia17. Against the backdrop of global warming, there has been abrupt warming in Northeast Asia since the 1990s2,18, which accompanies the increased frequency of summer heatwaves and the concurrent increase in severe droughts. Here, we present the quantification of the contribution of both Northern Hemisphere’s modes IPO and AMV to the abrupt warming in Northeast Asia in the 1990s.

To understand the contribution of external forcings and internal variability to SAT changes over Northeast Asia, it is crucial to precisely separate the forced and unforced responses. A popular tool to estimate the forced response is a multi-model ensemble mean (MMEM), across multi-model ensembles of the Coupled Model Intercomparison Project (CMIP)4,6. However, such multi-model ensembles cannot be used to separate unforced responses cleanly due to the differences across models arising from different in model physics or parametrizations and different model responses19,20. In contrast, single model initial-condition large ensembles (SMILEs) are powerful tools to cleanly separate the relative contribution of external forcing and internal variability21. Unlike the Atmospheric General Circulation Model simulations done before2,12, SMILEs also consider the influence of atmosphere-ocean coupled system. Furthermore, some CMIP5 or CMIP6 models which take account into cloud-aerosol interactions consistently overestimate the model response to GHG22,23 and aerosol changes24,25. The MMEM cannot totally eliminate the biases in simulating climate response, while a SMILE that can capture the observed forced signal adequately can reduce the biases to a certain extent. Estimating the contribution of external forcings and internal variability based on SMILEs that best capture the observed temperature evolution is of great importance for climate change attribution and near-future climate prediction. In this paper, we provide a robust evaluation of the SMILEs’ capacity to simulate the observed forced response in Northeast Asian SATs by applying a time series and rank frequency analysis26.

Here, we use ten different SMILEs from climate models of CMIP5 and CMIP6 generations, to provide evidence that the compound effect of AMV and external forcings contributed to most of the abrupt temperature increase in Northeast Asia in the 1990s. Furthermore, we also quantify how this warming rate will be amplified or weakened by the phase change of the AMV under prescribed anthropogenic emissions in the near future.

Results

Observed decadal warming in Northeast Asia

Since the 1980s, abrupt warming has been observed over Northeast Asia (Fig. 1a). The power spectrum of the SAT in Northeast Asia (NEASAT) is characterized by a high variance on multidecadal timescales, with a peak period of ~60 years, which is significant at the 95% confidence level (Supplementary Fig. 1). There is a sharp increase (1.21 K) in NEASAT around 1996 during 1901–2022 according to the moving Student’s t-test, consistent for different moving window selections (Supplementary Fig. 2). Statically significant warming between the interval 1997–2014 and 1980–1996 can be observed throughout East Asia, with the warming center in Northeast Asia (Fig. 1b). This pattern of uneven warming may result from human activities such as GHG2 via the ‘warmer climate – drier soil’ feedback27, or internal variability2,4,6.

Fig. 1: The decadal change of averaged surface air temperature (SAT) anomalies in Northeast Asia (NEASAT) between the interval 1997–2014 (P2) and 1980–1996 (P1) in boreal summer.
figure 1

a Time series of observed original (blue) and 9-year low-pass filtered (red) NEASAT anomalies (relative to the period 1961–1990) from 1901 to 2022 using BEST dataset. The blue horizontal lines in Fig. 1a represent the NEASAT averages for the interval 1980–1996 and 1997–2014, respectively. Units: K. b The composite difference of observed SAT averages between 1997–2014 and 1980–1996 using BEST dataset. Units: K. c The composite difference of observed SAT averages between 1997–2014 and 1980–1996 using multi-model ensemble mean (MMEM) of single model initial-condition large ensembles (SMILEs). Units: K. d The MMEM of the standard deviation (STD) of SAT change between P2 and P1 among members in a SMILE. e The NEASAT change between P2 and P1 using SMILEs, ordered by increasing Equilibrium Climate Sensitivity (ECS). Units: K. The black stippling in Fig. 1b, c denotes regional signals significant at the 95% confidence level. The gray box in Fig.1(b-d) represents Northeast Asia (NEA, 35°−52.5°N, 90°E−120°E) studied in this paper.

The relative contribution of external forcing and internal variability

The SAT variations in each grid can be considered as the superposition of the externally-forced (due to GHG, AA, natural forcing (NAT), and other external forcings) and the internally-generated variability (due to IPO, AMV, and other sources of internal variability). The forced response can be estimated as the ensemble mean across individual runs from each SMILE. The composite difference of the MMEM of externally-forced SAT between 1997–2014 and 1980–1996 shows significant warming centers over the mid-latitude of Eurasia landmass and Tibet Plateau (Fig. 1c). The forced warming results from increasing GHG2 and decreased aerosols outside Asia6. However, this simulated warming pattern based solely on the multi model forced response is inconsistent with the observed and cannot fully explain the observed warming (Fig. 1b). Therefore, internal variability may play an essential role in explaining this observed abrupt warming in Northeast Asia.

To evaluate this, it is crucial to robustly estimate the internal variability. For this, we first removed the forced response from the original variability by subtracting the ensemble mean from each member28. Then we calculated the standard deviation of internally-generated SAT difference between the interval 1997–2014 and 1980–1996 across the ensemble, representing the intensity of the internal variability. For the MMEM of the standard deviation, there is a center located on the southeast side of Lake Baikal, located east of the observed warming center (Fig. 1d). This shift in position may be attributed to the deviation of the center of the standard deviation to the east in some models, e.g., MPI-GE5, MPI-GE6, and CESM2. Most SMILEs showed consistent features, with strong internal variability centers in Northeast Asia (Supplementary Fig. 3), except for EC-Earth3 and UKESM1-0-LL. In general, internal variability is larger in Northeast Asia than in surrounding regions, which means that internal variability seems to play a critical role in the decadal variation of NEASAT.

The externally-forced SAT change, ranging from 0.46 K to 1.29 K (Fig. 1e), increases with increasing Equilibrium Climate Sensitivity (ECS). The simulated forced response varies greatly among models, so it is crucial to evaluate the ability of models to capture the observations. The observed SAT time series fluctuates around its long-term forced signal due to internal variability. Ideally, for a model that captures both forced signals and internal variability well, the observed SAT time series should occur across the whole ensemble spread of simulations with no preferred frequency, and mostly within the ensemble limits26. Not only can SMILEs separate these signals more cleanly, it is also an essential tool for evaluating model biases and determine if models really deviate from observations beyond what can be explained by internal variability or not. We used a time series and rank frequency analysis to identify possible model biases in capturing the external forcing and internal variability of NEASAT (see Methods).

For the time series analysis, the largest discrepancies occur in recent decades, with observations occurring outside the ensemble spread (Fig. 2). In general, this kind of discrepancies can be attributed to either the biases in the simulated internal variability or the forced response. Observations falling outside of the ensemble spread for the whole observational record results from the underestimation of the internal variability, or from the size of the ensemble being insufficient to fully sample this internal variability. However, if the discrepancies occur during a certain period, as it is the case here with the discrepancies occurring during the period of abrupt warming in recent years, this points to a bias in the model which fails to capture the forced response26.

Fig. 2: Time series (the first and third column) and rank histograms (the second and fourth column) of NEASAT anomalies in boreal summer (June-July-August; JJA) for SMILEs.
figure 2

(The first and third column) Black circles represent the time series of BEST observed NEASAT anomalies. Lines represent ensemble maxima and minima, and shading represents the ensemble spread within the 75th percentile bounds (12.5th to 87.5th percentiles). The numbers in the lower right corner represent the frequency with which the observations occur above the ensemble maximum (red), below the ensemble minimum (blue) and within the 75th percentile ensemble range (12.5th to 87.5th, gray). The numbers in the upper right bracket indicate the number of members. Units: K. (The second and fourth column) Rank histograms represent the frequency of each place that BEST NEASAT observations would take in a list of ensemble members ordered by ascending NEASAT values. Lines illustrate the rank histogram’s slope, as the mean rank frequency over a centered 6-bin window for BEST observations (solid lines, colors), and for the 90% confidence perfect model range as the 5th to 95th percentile range of rank slopes for each ensemble member treated as observations (dashed lines, black). Crosses represent the frequency of minimum (0) and maximum (number of members) ranks for observations (color), and the perfect model range as the 5th to 95th percentile in frequency (black). Units: %. Bin sizes are 1 rank. Anomalies are relative to the period 1850–1900.

For NEASATs, the observations are almost completely within the ensemble spread for the period of 1850−1980s (Fig. 2). Furthermore, there has been no significant change or trend in the intensity of internal variability from 1850 to 2014 (Supplementary Fig. 4). Therefore, our evaluation shows that the climate models assessed here do not systematically underestimate the internal variability in NEASAT. On the other hand, the simulated warming signal is substantially lower than the observation in some SMILEs, including MIROC6, GISS-E2-1-G, GISS-E2-1-H, and UKESM1-0-LL, with observations reaching values beyond the ensemble maxima with too high frequencies (Fig. 2). Two of these four SMILEs, MIROC6 and GISS-E2-1-G, substantially underestimate the forced warming because of the lower ECS (2.57 K and 2.72 K, respectively). 12% and 20% of observations occur above the ensemble spread in another two of the four SMILEs, GISS-E2-1-H and UKESM1-0-LL, respectively, despite relatively higher ECS (3.08 K and 5.36 K). The large uncertainty in aerosol simulation can influence this23,29. For IPSL-CM6A-LR and CanESM5, roughly 5% of observations occur below the ensemble minima respectively, indicating that these two models may overestimate the forced warming response due to higher ECS (4.56 K and 5.62 K, respectively).

We assess whether the cases where observations occur outside the ensemble spread can be attributed to internal variability using the perfect-model rank frequency test26 (see Methods). For this test, each member is treated as if it were the observations, and we evaluate how it differs from the rest of the ensemble by computing the rank histograms for each member against the new n members, with n being the total number of members of the ensemble. If observations deviate from the ensemble similarly as the ensembles individual members, then we can consider that the model is able to capture observations correctly, and the differences between observations and the model ensemble are not beyond what can be explained by internal variability.

Most of the models, including MIROC6, GISS-E2-1-G, GISS-E2-1-H, UKESM1-0-LL, and CanESM5, were found to not capture observations correctly, due to too high frequencies of rank 0 or rank n, meaning observations falling below or above the ensemble limits respectively, beyond this perfect model range (Fig. 2). Therefore, these models deviate from observations beyond what internal variability could explain and fail to capture observations correctly. For IPSL-CM6A-LR, observations fall below the ensemble minima with a frequency that could result from internal variability based on this evaluation.

Our findings show that only five SMILEs provide an adequate representation of observations for Northeast Asian SAT with rank histograms within the perfect-model range: MPI-GE5, MPI-GE6, EC-Earth3, IPSL-CM6A-LR and CESM2 (Fig. 2). The first four models have ECS (2.8 K, 3 K, 4.3 K and 4.56 K respectively) within the very likely range of 2 K to 5 K estimated in IPCC AR61, and the last model has higher ECS (5.16 K) which exceeds the very likely IPCC ECS range. Although high-ECS model (CESM2) can still reproduce observations, there may be potential discrepancies in the simulated forced signal because the very high ECS exceeds the very likely IPCC ECS1. Therefore, we base our analysis on MPI-GE5, MPI-GE6, EC-Earth3 and IPSL-CM6A-LR, the four climate models that (1) adequately capture the internal variability and forced response in observed Northeast Asian SATs and (2) exhibit ECS in line with the best-estimate IPCC ECS range.

Previous studies also state that MPI-GE5 offers the best global and regional representation of internal variability and forced response in observed historical SAT evolution26. The estimated contribution of the forced response to the NEASAT warming in the 1990s is 65% (41% to 82% for the model spread). The external forcing cannot totally explain the abrupt warming in Northeast Asia in the 1990s. Only by finding another important internally-generated factor can we fully explain the abrupt warming in Northeast Asia and thus predict the SAT change in Northeast Asia in the near future.

Impact of the AMV on NEASAT

We quantify the role that the two Northern Hemisphere’s decadal modes—the AMV and IPO — play in the abrupt warming in Northeast Asia in the 1990s. The 100 patterns of internally-generated sea surface temperature (SST) change derived from each member in MPI-GE5 were regressed onto the 100 values of internally-generated NEASAT change between 1997–2014 and 1980–1996 through the member index. The only difference among the 100 members from MPI-GE5 is internal variability. This regression method through the member index can help us identify the main internally-generated SST modes related to the NEASAT change between 1997–2014 and 1980–1996. The regression coefficient field features as an observed AMV-like pattern in the North Atlantic (80°W–0°E, 0°–60°N; Fig. 3a). The internally-generated SST difference between the interval 1997–2014 and 1980–1996 also shows similar results, with warming center in North Atlantic (Fig. 3b). In addition, there is a warm center in the North Pacific (150°E–150°W, 25°N–45°N) and a cold center in the tropical central-eastern Pacific (170°W–90°W, 10°S–10°N), respectively (Fig. 3b). This pattern resembles the IPO-like pattern. Ideally, the pre-industrial control (piControl) experiment can represent the internal variability well. The low-pass filtered NEASAT is related to the AMV index in the piControl experiment (r = 0.29, significant at the 99% confidence level with the effective degree of freedom = 139; Supplementary Fig. 5).

Fig. 3: Influence of the Atlantic Multidecadal Variability (AMV) and Interdecadal Pacific Oscillation (IPO) on the decadal change of NEASAT between 1997–2014 (P2) and 1980–1996 (P1) in boreal summer.
figure 3

a The regression pattern of the internally-generated sea surface temperature (SST; Units: K) change with respect to the internally-generated NEASAT change between 1997–2014 and 1980–1996 across the MPI-GE5 100 members through the member index. b The changes of internally-generated SST averages between 1997–2014 and 1980−1996 (Units: K). c Histograms of the frequency of occurrence of the NEASAT change between 1997–2014 and 1980−1996 (Units: %) for the 100 members of MPI-GE5 derived from the original results (light blue) and those adjusted accounting for the influence of the observed IPO phase transition (pink). The dots in Fig. 3c denote the observational NEASAT change (black) and the change for the ensemble mean without (blue) and with IPO adjustments (red). d Similar with (c), but applied AMV-adjustments. The dots in Fig. 3d denote the observational NEASAT change (black) and the change for the ensemble mean without (blue) and with AMV adjustments (red). e The contribution of IPO for different SMILEs (Units: %). f The contribution of AMV for different SMILEs (Units: %). The purple boxes in Fig. 3a, b denote the northern Atlantic (80°W–0°E, 0°–60°N), the tropical central-eastern Pacific (TCEP, 170°W–90°W, 10°S–10°N) and the northern Pacific (NP, 150°E–150°W, 25°N–45°N). The numbers in the upper right corner of (c, d) indicate the contribution of the ensemble mean without (blue) and with adjustments (red) to the observation. The numbers in the upper left bracket in (c, d) indicate the number of members.

The AMV and IPO evolutions were adjusted in each historical run of SMILEs based on observations to quantify the relative contribution (see Methods). Results are based on MPI-GE5, and the results of other models are shown in the supplementary. For example, for AMV, the composite changes of the AMV evolution between 1997–2014 and 1980−1996 in each member of SMILEs were replaced by observed changes (see Eq. 7 in Methods). Thus, the ensemble mean of NEASAT change derived from adjusted members represents the NEASAT change forced by external forcing and observed AMV evolution. Similarly, we can get the adjusted NEASAT related to external forcing and observed IPO evolution.

The increase in NEASAT based on external forcings during the interval 1997–2014 is only 0.46 K higher than during the 1980−1996 period in MPI-GE5. However, this is still substantially less than the observed warming of 1.12 K, indicating that external forcings are not sufficient to explain this warming. The probability distribution of NEASAT change shifted to a much warmer state after applying AMV- and IPO- adjustment for different SMILEs (Fig. 3c, d, Supplementary Figs. 6, 7). After IPO-adjustment in MPI-GE5, the IPO-induced increase in NEASAT enhanced the warming in Northeast Asia from 0.46 K (41% of the observed value) to 0.63 K (56% of the observed value), thus marking an IPO-contribution of 15% of the warming in NEASAT (Fig. 3c). However, the contribution of IPO varies among different models, ranging from −14% to 34% (Fig. 3e), which indicates the uncertainty in simulating the effect of IPO. For MMEM, the IPO can only explain 7% (−14% to 34% for the model spread) of the total warming in NEASAT (Fig. 3e).

The AMV-adjusted NEASAT change (0.78 K) between the interval 1997–2014 and 1980–1996 in MPI-GE5 accounts for 70% of the observed SAT change (Fig. 3d). The contribution of AMV is 29% for MPI-GE5. This significant and large contribution can also be found for other SMILEs. The AMV can explain 23% (19% to 29% for the model spread) of the total warming in NEASAT (Fig. 3f). Therefore, the influence of AMV is much larger and robust across climate models than the IPO contribution. Combined with the contribution of external forcing (65% for MMEM, 41% to 82% for the model spread), the AMV-adjusted NEASAT change can account for 88% (60% to 111% for the model spread) of the observed warming. The change of NEASAT in the near future can be projected based on the potential future phase change of AMV. It is noted that MPI-GE5 does not include runs that show NEASAT differences that are close to the observations (Fig. 3c, d). We computed the probability for a NEASAT and AMV difference in MPI-GE5 that is as large or larger than the one from observations (i.e., a p value; Supplementary Fig. 13). Changes in AMV and NEASAT in observations are less likely to occur in the MPI-GE5 (p value 0.37% and 0.46% for NEASAT and AMV change respectively). The two p values for NEASAT and AMV difference are about the same, indicating that such extreme NEASAT and AMV changes may have occurred simultaneously. We consider that the observation-based difference is outside the most extreme individual model results because the small region is indeed going to experience statistically unlikely changes due to internal variability. This further verifies the critical role of AMV in NEASAT. The other nine SMILEs all include runs that show NEASAT differences that are close to the observations (Supplementary Fig. 7). We think this is because the estimated forced responses in these nine SMILEs are stronger than in MPI-GE5 (Fig. 2, Supplementary Fig. 7), so the model can reach the observations more easily with less extreme internal variability. The AMV-triggered barotropic circumglobal teleconnection responses feature anticyclonic circulations over Northeast Asia8,30,31. This response can cause increased warm advection towards Northeast Asia and downward shortwave radiation, further amplifying the warming in Northeast Asia8,31. Notably, the negative-to-positive phase change of the AMV increased the warming by 70% in MPI-GE5. If the current emissions continue over the next several decades, the phase transition of the AMV will significantly affect the decadal projection of NEASAT.

Future change in NEASAT due to phase change of the AMV

Northeast Asia has been facing an increasing frequency of summer heat waves in the last two decades32. Superimposed on the external forcing, the negative-to-positive phase shift of AMV enhanced the decadal warming in Northeast Asia in the 1990s. Besides, the phase change of AMV will also influence the projection of NEASAT in the near future, potentially leading to more periods of amplified warming and respective impacts. The evaluation of 10 different SMILEs highlighted MPI-GE5, MPI-GE6, EC-Earth3 and IPSL-CM6A-LR as the models that best simulate the internal variability and external forcing in observations while having a realistic ECS range. Because the large ensemble of the Shared Socioeconomic Pathways version 5−8.5 (SSP5-8.5) runs of IPSL-CM6A-LR are not available, we use the left three models to investigate the influence of phase change of AMV in the future. All the three models show an adequate representation of the observed AMV index, which is well within the ensemble spread of MPI-GE5, MPI-GE6 and EC-Earth3 (Supplementary Fig. 8), denoting that these three models can reasonably simulate the intensity of AMV variations.

In the near future (next twenty years), the AMV, which is in a positive phase since the late 1990s, may shift from a positive phase to a negative or neutral due to internal variability. This behavior in the near-term future is showcased across the different members of MPI-GE5. For example, the AMV derived from the member 73 shifts to a negative AMV phase reaching minus two standard deviations below the mean value (\(A-2{STD}\); by analogy, we can get \(A-{STD}\), \(A+{STD}\) and \(A+2{STD}\)) of observation (Fig. 4a). In contrast, the AMV derived from the member 65 remains in a positive phase since the 1990s for several years into the future (Fig. 4a).

Fig. 4: Influence of AMV on the future NEASAT decadal change between 2023–2042 and 1997–2022 under the representative concentration pathway version 8.5 (RCP8.5) or Shared Socioeconomic Pathways 5-8.5 (SSP5-8.5) emissions scenario.
figure 4

a Time series of observed original AMV index (gray) and 9-year low-pass filtered AMV index (Units: K) derived from observation (dark gray), MPI-GE5 member 73 (blue) and MPI-GE5 member 65 (red). The black horizontal lines in (a) denote the average plus (minus) double standard deviation of the observed 9-year low-pass filtered AMV index (\(A\pm 2{STD}\)) and average plus (minus) standard deviation (\(A\pm {STD}\)), respectively. b Histograms of the frequency of occurrence of the NEASAT change between 2023–2042 and 1997–2022 (Units: %) for the 100 members of MPI-GE5 derived from the original results (dark gray) and those adjusted accounting for the influence of the different AMV phase transition (red for \(A+2{STD}\), orange for \(A+{STD}\), light blue for \(A-{STD}\) and blue for \(A-2{STD}\)). c Same as (b), but using MPI-GE6. d same as (b), but using EC-Earth3. The pink shading in (a) represents the spread of the 9-year low-pass filtered AMV index from MPI-GE5 100 members. The black dot in (b–d) denotes the NEASAT change for the ensemble mean without AMV adjustments. The red, orange, light blue and blue dots in (b–d) indicate the NEASAT change for the ensemble mean with different AMV adjustments (red for \(A+2{STD}\), orange for \(A+{STD}\), light blue for \(A-{STD}\) and blue for \(A-2{STD}\)), respectively.

To quantify the contribution of AMV to NEASAT temperature change in the near future, the AMV evolution was adjusted in each member of these three models based on the different phases (\(A-2{STD}\), \(A-{STD}\), \(A+{STD}\) and \(A+2{STD}\)). For \(A-{STD}\), the NEASAT warming will decrease by 23% (18% to 30% for the model spread) under the representative concentration pathway version 8.5 (RCP8.5) or SSP5-8.5 emission scenario (Fig. 4b–d). In some extreme cases, the phase intensity of the AMV could approach \(A-2{STD}\) as it did in the 1970s (Fig. 4a), which will offset externally-forced NEASAT warming by 37% (29% to 49% for the model spread) (Fig. 4b–d). When the intensity of the AMV reaches \(A+{STD}\) and \(A+2{STD}\), the NEASAT warming can be intensified by 5% (4% to 6% for the model spread) and 19% (15% to 25% for the model spread), compared with the warming caused by the external forcing under the RCP8.5 or SSP5-8.5 emission scenario (Fig. 4b–d). The offset warming will make the climate in Northeast Asia more stable. In contrast, the increased warming will make Northeast Asia have a hotter and drier climate, which will face a greater risk of heat waves and associated impacts.

Discussion

We quantify the drivers of the abrupt warming in Northeast Asia in the 1990s. Based on time series and rank frequency analysis, The AMV is a significant internally-generated factor, which contributes to 23% (19% to 29% for the model spread) of the NEASAT change in the 1990s. The AMV, combined with external forcings, explains 88% (60% to 111% for the model spread) of the observed change, which lays a foundation for adjusting projections of the NEASAT change in the near future accounting for different AMV phase changes. The remaining unexplained warming can be attributed to internal atmospheric variability8,33 and other sources34,35 (e.g., land surface conditions). Under the prescribed RCP8.5 or SSP5-8.5 emission scenario in the near future, the externally-forced NEASAT warming is projected to be intensified by 19% (15% to 25% for the model spread) and 5% (4% to 6% for the model spread) if the AMV reaches \(A+2{STD}\) and \(A+{STD}\) phases, respectively. In contrast, AMV will offset the externally-forced NEASAT warming by 37% (29% to 49% for the model spread) and 23% (18% to 30% for the model spread) if the AMV reaches \(A-2{STD}\) and \(A-{STD}\) phases.

We assess SMILEs from 10 climate models, chosen because these models have enough individual runs and long enough pre-industry control runs. We find that the contribution of external forcing to the change in NEASAT is highly uncertain across different models, ranging from 41% to 113% among the SMILEs (Fig. 1e). The ECSs of these models are relatively evenly distributed, ranging from 2.57 K to 5.62 K. This study used the time series and rank frequency analysis to evaluate the ability in capturing the external forcing and internal variability for ten SMILEs. We identify MPI-GE5, MPI-GE6, EC-Earth3 and IPSL-CM6A-LR as the models that best simulate observations with ECS values in line with the best IPCC estimates. The contribution of external forcing is estimated as 65% for MMEM (41% to 82% for the model spread). Note that some CMIP5 or CMIP6 models which take into account cloud-aerosol interactions may overestimate the model response to GHG22,23 and aerosol changes24,25. MPI-GE5 and MPI-GE6 (based on MPI-ESM1.1 and MPI-ESM1.2, respectively) have specifically tuned cloud feedbacks to better match the historical warming28. Since 1980, the aerosol effective radiative forcing has not changed substantially36. In this study, we used 10 SMILEs which have different treatments of cloud-aerosol interaction. The uncertainty in the representation of aerosol effective radiative forcing has also been taken into account when evaluating the 10 SMILEs based on time series and rank frequency analysis. Therefore, the results in this study are based on a robust evaluation of models that sample a wide range of different model behaviors, and highlight the importance of carefully choosing climate models that can adequately represent both the external forced signal and internal variability in observations.

The influence of AMV on decadal changes in NEASAT is highlighted in this paper. If the AMV turns to a negative phase, the warming caused by the external forcing will be largely offset. If the AMV remains positive or even intensifies, Northeast Asia is expected to experience warmer climates and more heatwave occurrences. The thermal conditions in Northeast Asia will also affect the climate of East Asia through non-local response18,37,38, which will have an important impact on East Asia’s production, life and economic development. Improving the interdecadal forecast of AMV is necessary for government decision-making and risk management. Northeast Asia will continue to warm in the near future under the RCP8.5 or SSP5-8.5 emission scenario, regardless of how the AMV changes. The only thing we can do is to adopt strong climate policies and achieve carbon neutrality by cutting emissions as soon as possible.

Methods

Observation data

Monthly SST data from 1901 to 2022 was obtained from the Hadley Center Global Sea Ice and Sea Surface Temperature (HadlSST) dataset39. Observed monthly SAT data were obtained from Berkeley Earth Surface Temperature (BEST) dataset40 for the interval 1901–2022 and Climatic Research Unit Time-Series version 4.06 (CRUTS) for the interval 1901–2021 datast41.

Single model initial-condition large ensembles (SMILEs)

SMILE experiments include a large number of members of a single model forced by the same historical external forcings but starting from different initial conditions. The simulations from SMILEs only differ in internal variability. This experimental design allows a quantification for externally-forced and internally-generated variability28,42,43. This paper mainly obtained the results using the 100-member Max Planck Institute for Meteorology Grand Ensemble of CMIP5 generations (MPI-GE5) generated by the Max Planck Institute Earth System Model version 1.1 (MPI-ESM1-1)28. Besides, we include other SMILEs from a broad range of climate models of CMIP6 generations: GISS-E2-1-G, GISS-E2-1-H, MIROC6, Max Planck Institute for Meteorology Grand Ensemble of CMIP6 generations (MPI-GE6), EC-Earth3, IPSL-CM6A-LR, UKESM1-0-LL, CESM2 and CanESM5 (see Supplementary Table 1). We chose these ten modes because only these ten modes have more than 15 individual historical runs and pre-industry control (Pi-control) experiments. The MPI-GE5 (CMIP6) historical simulations covering the period 1901–2005 (1901–2014) are forced by all historical forcing agents. The MPI-GE5 historical simulations were extended to 2014 following the RCP8.5 emissions scenario. The MPI-GE5 and CMIP6 historical runs were not extended to 2022 because historical individual runs in some CMIP6 models are not in one-to-one correspondence with runs in the Shared Socioeconomic Pathways version 5-8.5. All simulations were interpolated onto a 1.875° × 1.875° grid by bilinear interpolation.

Time series and rank frequency analysis

The time series and rank frequency analysis19,26,44,45 is used to identify the potential biases in simulating the external forcing and internal variability based on the metric: whether the ensemble spread adequately captures the long-term forced signal as well as the possible deviation from the forced signal caused by internal variability. The climatological reference period was determined as 1850–1900 to identify the forced response more robustly. Three main indicators are highlighted to describe the distribution of observation across the ensemble: the frequency with which the observations occur above the ensemble maximum, below the ensemble minimum and within the 75th percentile ensemble range (12.5% to 87.5%).

For the rank frequency analysis, we first calculated the rank with which the observed NEASAT occurs in a list of NEASAT ensemble in ascending order for each year. For rank 0, the observed NEASAT in a given year occurred below all the NEASAT simulated by all the ensemble members. For rank n, with n the number of the members in a SMILE, the observed NEASAT in a given year is higher than all simulated values. We further calculated the frequency with which the observed NEASAT takes place in each rank. For a model that can adequately capture the observation, the observation should take place at each rank without any preference, so the rank frequency histogram is relatively flat. If the observations take place outside of the ensemble spread in time series, this kind of discrepancies can be attributed to either the simulation biases in internal variability or the forced response. The perfect-model rank frequency method test was introduced to assess whether the cases where observations occur outside the ensemble spread can be attributed to internal variability. Each member is treated as if it were observation, and compute the rank frequency histograms for each member against the new n members. Therefore, the perfect-model rank frequency range (90% confidence range, 5th–95th percentiles) can be obtained. If the observations exceed this range at any rank, we can determine that the distribution in the SMILE deviates from the observed distribution which can’t be explained by the internal variability. Lastly, we can identify the models with a better capacity for simulating the external forcing and internal variability and further capture the observation more adequately.

Separating externally-forced and internally-generated variability in the SMILEs

This paper used SMILEs to investigate the impact of internal variability and external forcing. Take MPI-GE5 for example. Since the same external forcing forces the 100 members of MPI-GE5, the differences between the 100 members only differ in the initial conditions, which reflects differences in internal variability. This kind of difference can be smoothed out after averaging over 100 members (hereafter referred to as “ensemble mean”). The ensemble mean \({(T\left(n,i\right))}_{{EM}}\) of SAT or SST \(T\left(n,i\right)\) at a certain grid can be taken as the model’s response to external forcing (\({T}_{{EV}}\), externally-forced variability)46:

$${T}_{{EV}}(n)={(T\left(n,i\right))}_{{EM}}$$
(1)

where \(i\) refers to the number of members for a SMILE, and \(n\) represents the year.

Thus, we can get the unforced variability \({T}_{{IV}}(n,i)\) of member \(i\) in MPI-GE5:

$${T}_{{IV}}\left(n,i\right)=T\left(n,i\right)-{T}_{{EV}}(n)=T\left(n,i\right)-{(T\left(n,i\right))}_{{EM}}$$
(2)

Separating externally-forced and internally-generated variability in the observed SSTs

The SST \(T\left(n,i\right)\) in the observation was also separated into externally-forced and internally-generated variability at each grid \(i\) for year \(n\). Using the global-mean SST time series from the ensemble mean simulations is a preferred way to define the forced signal, which has the similar temporal evolution at all grids because it is determined by the external forcing series47. The forced response at each grid point was estimated as the product of the regression coefficients and forced signal (i.e., rescaled global-mean SST time series from the ensemble mean simulations). Take MPI-GE5 for example. We used the rescaled global-mean SST \(\bar{{T}_{m}}\left(n\right)\), including the area north of 60°S and south of 75°N, from the ensemble mean of individual runs from MPI-GE5 as the first-order estimate of the forced response \({T}_{{EV}}\left(n,i\right)\), which can be estimated through linear regression48:

$$T\left(n,i\right)={b}_{{EV}}(i)\bar{{T}_{m}}\left(n\right)+\varepsilon$$
(3)
$${T}_{{EV}}\left(n,i\right)={b}_{{EV}}\bar{{T}_{m}}\left(n\right)$$
(4)

The forced response \({T}_{{EV}}\left(n,i\right)\) was subtracted from the observation at each grid point to produce the residual field, which primarily contains internal variability (internally-generated variability).

Definition of AMV and IPO

IPO and AMV in observation

Firstly, we removed the linear trend from the observation. IPO index is defined as the 9-year low-pass filtered time series of difference in the detrended SST anomaly between the tropical central-eastern Pacific (TCEP, 170°W–90°W, 10°S–10°N) and the northern Pacific (NP, 150°E–150°W, 25°N–45°N)46,49,50. For AMV index, previous studies have shown that both internal variability and external forcing (e.g., decadal changes in GHG and aerosols) have contributed to the AMV51. Conversely, the IPO index is insensitive to the definition selection52. We firstly calculated global mean SST excluding the Northern Atlantic (80°W–0°E, 0°–60°N) as the gSSTmA index. The AMV index is defined as the 9-year low-pass filtered time series of area-averaged SST over the Northern Atlantic after 9-year low-pass filtered gSSTmA index has been subtracted53.

IPO and AMV in each run of SMILEs

We subtracted the ensemble mean of SSTs from each run at each grid point in the SMILEs to produce the unforced fields. IPO index is defined as the 9-year low-pass filtered time series of difference in the unforced SSTs between TCEP and NP. The AMV index is defined as the 9-year low-pass filtered time series of area-averaged unforced SSTs over the Northern Atlantic (80°W–0°E, 0°–60°N).

Adjustment of SAT based on observed AMV and IPO

Previous studies have applied an “adjustment method” to investigate the climate response to the IPO or AMV50,54. Take AMV for example. In this paper, the original simulated AMV-related SAT change between the interval 1997–2014 and 1980–1996 in each individual run was replaced with that caused by the observed phase change of AMV:

$$\Delta {T}_{{adjust}}\left(i\right)=\Delta {T}_{{EV}}\left(i\right)+\Delta {T}_{{IV}{\rm{\_}}{adjust}}\left(i\right)$$
(5)

where \(i\) represents the number of members for a SMILE, \(\Delta {T}_{{adjust}}\left(i\right)\) represents the AMV-adjusted SAT response for member \(i\), \(\Delta {T}_{{EV}}\left(i\right)\) refers to the SAT response to external forcing for member \(i\) and \(\Delta {T}_{{IV\_adjust}}\left(i\right)\) represents the SAT response to internal variability after applied with AMV-adjustment for member \(i\). \(\Delta {T}_{{IV\_adjust}}\left(i\right)\) at a grid is given by:

$$\Delta {T}_{{IV}{\rm{\_}}{adjust}}\left(i\right)=\Delta {T}_{{IV}{\rm{\_}}{orignial}}\left(i\right)+\varepsilon \left(i\right)$$
(6)

where \(\Delta {T}_{{IV\_orignial}}\left(i\right)\) represents the original simulated SAT response to internal variability for member \(i\). \(\varepsilon \left(i\right)\) is the adjustment term, which can be estimated by:

$$\varepsilon \left(i\right)=-{r}_{T,{AMO}}\left(\Delta {AMO}\left(i\right)-\Delta {{AMO}}_{{obs}}\right)$$
(7)

where \(\Delta {AMO}\left(i\right)\) and \(\Delta {{AMO}}_{{obs}}\) represent the AMV phase change between the interval 1997–2014 and 1980–1996 for member \(i\) in a SMILE and observation. \({r}_{T,{AMO}}\) represents the regression coefficient of the SAT regressed onto the 41-year low-pass filtered AMV from the Pi-control simulation. The main reason for choosing 41-year low-pass filtered AMV is as follows: we would like to obtain the intensity of the SAT response to the phase transition of the AMV, which shows multidecadal variability over 40 years (Supplementary Figs. 1, 9). NEASAT and AMV are closely related only on the multidecadal scale (r = 0.95, p < 0.01; Supplementary Fig. 10). The response of SAT to the original AMV (which wasn’t applied with low-pass filtering) in Northeast Asia is negative, inconsistent with the observed results (Fig. 1c, Supplementary Fig. 11b). The northern Atlantic SST is closely connected with NEASAT via atmospheric circulations over interannual timescales55,56,57,58,59. Because of the interdecadal mutations studied in this work, the interannual variability must be eliminated by low-pass filtering. After applying a 41-year low-pass filtering, the AMV positively correlates with SAT in Northeast Asia, consistent with the observed results (Fig. 1c, Supplementary Fig. 12b).

Statistical analysis

A composite analysis was used to calculate SAT decadal changes between 1997–2014 and 1980–1996. A moving Student’s t test analysis was used to detect the mutational point of NEASAT60,61,62. A Student’s t test was also used to test the significance of the composite differences and linear regression.