0% found this document useful (0 votes)
18 views

Understanding The Host Network

Uploaded by

elith_never
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Understanding The Host Network

Uploaded by

elith_never
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Understanding the Host Network

Midhul Vuppalapati Saksham Agarwal Henry N. Schuh


Cornell University Cornell University University of Washington

Baris Kasikci Arvind Krishnamurthy Rachit Agarwal


University of Washington University of Washington Cornell University
ABSTRACT 1 INTRODUCTION
The host network integrates processor, memory, and peripheral in- In conversations on “networks”, our community usually engages in
terconnects to enable data transfer within the host. Several recent discussions on the Internet, datacenter networks, mobile networks,
studies from production datacenters show that contention within etc. This paper is about a different network—the host network—that
the host network can have significant impact on end-to-end appli- integrates processor, memory, and peripheral interconnects to en-
cation performance. The goal of this paper is to build an in-depth able data transfer between devices (processors, memory, network
understanding of such contention within the host network. interface cards, storage devices, accelerators, etc.) within a host.
We present domain-by-domain credit-based flow control, a con- Several studies from large-scale production datacenters [1, 42, 44]
ceptual abstraction to study the host network. We show that the have demonstrated that contention within the host network can re-
host network performs flow control over different domains (sub- sult in significant throughput degradation, tail latency inflation, and
networks within the host network). Different applications may tra- isolation violation for networked applications. As eloquently argued
verse different domains, and may thus observe different performance in [1, 2, 10, 42], the host network is becoming an increasingly promi-
degradation upon contention within the host network. Exploring the nent bottleneck due to unfavorable technology trends: performance
host network from this lens allows us to (1) near-precisely explain of peripheral interconnects is improving much more rapidly than
contention within the host network and its impact on networked processor and memory interconnects, resulting in increasing imbal-
applications observed in previous studies; and (2) discover new, pre- ance of resources and contention within the host network. Designing
viously unreported, regimes of contention within the host network. future network protocols, operating systems, and hardware requires
More broadly, our study establishes that contention within the an in-depth understanding of the various regimes and root causes of
host network is not merely due to limited host network resources such contention within the host network.
but rather due to the poor interplay between processor, memory, Processor, memory, and peripheral interconnects have been stud-
and peripheral interconnects within the host network. Moreover, ied for decades in the computer architecture community [16, 23,
contention within the host network has implications that are more 31, 36, 37, 47–54, 62–64]; however, these works primarily focus
far-reaching than the context of networked applications considered in on the behavior of individual interconnects rather than the inter-
previous studies: all our observations hold even when all applications play between these interconnects that leads to contention within the
are contained within a single host. host network. Recent work from the computer networking commu-
nity [1, 2, 42, 44, 55] studies the impact of contention within the
CCS CONCEPTS host network on the behavior of end-to-end network protocols (e.g.,
• Hardware → Networking hardware; • Networks → Network packet queueing and drops at the host), rather than characterizing
performance analysis; Network servers. the root causes of contention within the host network. Thus, our
understanding of the host network—especially the interplay between
processor, memory, and peripheral interconnects that leads to con-
KEYWORDS
tention within the host network—is rudimentary at best. The goal of
Host network, host architecture, performance analysis this paper is to advance this status quo.
ACM Reference Format: The key idea that drives our study is domain-by-domain credit-
Midhul Vuppalapati, Saksham Agarwal, Henry N. Schuh, Baris Kasikci, based flow control, a conceptual abstraction to study the host net-
Arvind Krishnamurthy, and Rachit Agarwal. 2024. Understanding the Host work. We demonstrate that the host network can be decomposed
Network. In ACM SIGCOMM 2024 Conference (ACM SIGCOMM ’24), into multiple “domains” (sub-network of the host network)1 , each
August 4–8, 2024, Sydney, NSW, Australia. ACM, New York, NY, USA, of which uses an independent credit-based flow control mechanism.
14 pages. https://doi.org/10.1145/3651890.3672271 Specifically, the sender of each domain is assigned credits that are
used to limit the amount of data the sender can inject into the domain;
Permission to make digital or hard copies of all or part of this work for personal or the sender consumes a credit to send one message, and this credit
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
is replenished when the message receipt is acknowledged by the
on the first page. Copyrights for components of this work owned by others than the receiver of the domain. Different domains within the host network
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or have different numbers of credits and different unloaded latency.
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]. Each data transfer, depending on the source (compute or peripheral
ACM SIGCOMM ’24, August 4–8, 2024, Sydney, NSW, Australia
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM. 1 The notion of a domain here is different from administrative domains in the Internet
ACM ISBN 979-8-4007-0614-1/24/08 architecture, and in ATM networks [4, 13, 20, 21]. Importantly, unlike administrative
https://doi.org/10.1145/3651890.3672271 domains, the domains in our study do not need to be non-overlapping.
Ice Lake Cascade Lake
device) and on the type (read or write), traverses a different set of do- CPU Xeon Platinum 8362 Xeon Gold 6234
mains. The end-to-end performance for each transfer depends on the Cores 32 @ 2.8GHz 8 @ 3.3GHz
number of credits and the per-request latency of domains traversed LLC 48MB 24MB
DRAM 4× 3200MHz DDR4 2× 2933MHz DDR4
by that transfer. Many details of existing host network hardware are DRAM BW 102.4GB/s 46.9GB/s
not public; nevertheless, we reverse engineer Intel host architecture PCIe 8× PM173X NVMe 4× P5800X NVMe
to characterize each domain, its credits, and its unloaded latency. PCIe BW 32GB/s 16GB/s
The lens of domain-by-domain credit-based flow control enables Table 1: Hardware configuration of our two servers. All of the specifica-
us to capture the subtle interplay between processor, memory, and tions are for a single socket in each server. DRAM and PCIe bandwidths
peripheral interconnects that leads to nanosecond-scale latency in- are theoretical maximum values.
flation and host resource underutilization in certain domains. This,
along with the knowledge of domains traversed by each data transfer,
allows us to near-precisely explain how nanosecond-scale ineffi-
2 HOST NETWORK CONTENTION REGIMES
ciencies within the host network percolate through the host hard- In this section, we broadly characterize the interplay of processor,
ware, operating systems, and network protocols to negatively impact memory, and peripheral interconnects within the host network using
application-level performance. We do this both for applications gen- four “quadrants” (§2.2) that reveal different regimes in terms of
erating peripheral-to-memory traffic (referred to as P2M apps, e.g., contention within the host network and performance degradation for
networked and storage applications [11, 14, 61, 70, 72]) and for C2M and P2M apps. Our key findings are:
applications generating compute-to-memory traffic (referred to as • The first regime, referred to as the blue regime, captures a new
C2M apps, e.g., in-memory databases [17, 19, 57, 60] and systems phenomenon: C2M apps observe performance degradation, while
for graph analytics [24, 40, 45]). More concretely: P2M apps observe minimal or no performance degradation. Sur-
prisingly, this phenomenon can happen even when memory band-
• We reproduce the phenomenon of contention within the host net-
width is far from saturated.
work and its impact on networked applications observed in pre-
• The other regime, referred to as the red regime, captures the
vious studies. We do this for both networked applications using
phenomenon observed in previous studies [1, 2, 42, 44]: P2M apps
in-kernel [2] and hardware-offloaded RoCE/PFC [42, 44] trans-
observe severe performance degradation when memory bandwidth
port protocols. We find that networked applications (P2M apps in
gets saturated. In addition, we find that C2M apps also observe
our case) can indeed suffer from performance degradation, e.g.,
significant performance degradation.
when both C2M apps and networked applications are doing mem-
ory writes. We provide precise root causes for the phenomenon. This section focuses on characterizing the host contention regimes;
We also extend observations made in all prior studies [1, 2, 42, 44]: we discuss the root causes in §5. We first focus on a setup where all
we demonstrate (and provide explanation for) degradation in C2M traffic is contained within the host (P2M traffic generated by locally
app performance along with networked app performance. attached storage devices)—this allows us to isolate the impact of con-
• We identify new, previously unreported, regimes of contention tention within the host network from the impact of network protocol
within the host network: we find that—in sharp contrast to the behavior (packet drops, queueing delays, and/or PFC pause frames)
phenomenon observed in previous studies [1, 2, 42, 44]—P2M on application performance. We then discuss how our observations
apps can, in fact, cause severe performance degradation for C2M generalize to networked applications in §2.3.
apps for most workloads, with minimal or no impact on the P2M We use two testbeds with different processors and different re-
app performance. For instance, we observe that when C2M apps source ratios (Table 1). The first testbed uses Intel Ice Lake proces-
are colocated with P2M apps performing storage operations, the sors and has roughly the same resource ratios (cores, memory band-
C2M app suffers from 1.2 − 2× performance degradation, with width and PCIe bandwidth) as the testbed in the Google study [1].
no impact on the P2M app performance. These new regimes are The second testbed uses Intel Cascade Lake processors and has a
reproducible across multiple generations of servers with differ- lower core to memory bandwidth ratio. We run our experiments on
ent processors, different memory bandwidth to core count ratios, a single socket. Each DRAM module is attached through a separate
and different configurations (e.g., with and without direct cache channel, and a simple sequential read microbenchmark saturates
access [28], with and without prefetching, etc.). more than 90% of theoretical maximum memory bandwidth.

Our study suggests that the impact of contention within the host net- 2.1 Host network contention with real applications
work has implications that are more far-reaching than the context of
We first present the new phenomenon: C2M apps observing perfor-
networked applications considered in previous studies [1, 2, 42, 44].
mance degradation, with minimal or no performance degradation for
In particular, all our observations hold even when all applications
P2M apps. This phenomenon is reproducible for a variety of con-
are contained within a single host (e.g., using storage applications
figurations: multiple C2M apps with different compute-to-memory
that generate P2M traffic using locally-attached storage devices).
bandwidth demands, multiple server configurations, with and with-
Thus, our work may be of independent interest to researchers and
out Intel Data Direct I/O (DDIO) [28] technology, and with and
practitioners not only in computer networking but also in operating
without prefetching.
systems and computer architecture.
The code, along with the documentation necessary to repro- C2M and P2M apps used in our experiments. We use two C2M
duce our results, is available at https://github.com/host-architecture/ apps, each with different compute-to-memory bandwidth demands
understanding-the-host-network. and with different access patterns. The first C2M app is a popular
Redis (C2M) FIO (P2M) GAPBS (C2M) FIO (P2M) C2M (Redis) C2M (GAPBS)
Performance Degradation

Performance Degradation
2 2 P2M (FIO) P2M (FIO)
Theoretical Maximum Theoretical Maximum

Memory Bandwidth

Memory Bandwidth
120 120

Utilization (GB/s)

Utilization (GB/s)
1.8 1.8
100 100
1.6 1.6 80 80
1.4 1.4 60 60
40 40
1.2 1.2 20 20
1 1 0 0
1 3 5 7 9 11 13 15 1 5 9 13 17 21 25 29 1 3 5 7 9 11 13 15 1 5 9 13 17 21 25 29
No. of Redis Server Cores No. of GAPBS Cores No. of Redis Server Cores No. of GAPBS Cores

Figure 1: A new phenomenon of contention within the host network: C2M and P2M apps are colocated, C2M app performance degrades while P2M
app performance is unaffected. This happens even though cores are isolated and memory bandwidth is far from saturated. (a-d; left-right) (a, b)
Performance degradation observed by C2M and P2M when they are colocated (ratio of isolated throughput and colocated throughput for each data
point; degradation for GAPBS is the slowdown—ratio of colocated execution time to isolated execution time); (c, d) Memory bandwidth utilization
when C2M and P2M are colocated, broken down by C2M and P2M.

Redis (C2M) / DDIO on GAPBS (C2M) / DDIO on Redis (C2M) / DDIO on GAPBS (C2M) / DDIO on
Redis (C2M) / DDIO off GAPBS (C2M) / DDIO off Redis (C2M) / DDIO off GAPBS (C2M) / DDIO off
FIO (P2M) / DDIO on FIO (P2M) / DDIO on FIO (P2M) / DDIO on FIO (P2M) / DDIO on
FIO (P2M) / DDIO off FIO (P2M) / DDIO off FIO (P2M) / DDIO off FIO (P2M) / DDIO off
Performance Degradation

Performance Degradation

2 2 50 50

Memory Bandwidth

Memory Bandwidth
Utilization (GB/s)

Utilization (GB/s)
1.8 1.8 40 40
1.6 1.6 30 30
1.4 1.4 20 20
1.2 1.2 10 10
1 1 0 0
1 2 3 1 2 3 4 5 6 1 2 3 1 2 3 4 5 6
No. of Redis Server Cores No. of GAPBS Cores No. of Redis Server Cores No. of GAPBS Cores

Figure 2: Enabling DDIO can worsen performance degradation for both C2M and P2M applications when the working set size does not fit in cache
(left-right, a-d) (a, b) Performance degradation for C2M and P2M when they are colocated with DDIO on/off; (c, d) Memory bandwidth utilization
when C2M and P2M are colocated with DDIO on/off.

in-memory database called Redis [57], and the second C2M app is a operations, for example, storage nodes of distributed data stores [61]
standard graph processing framework called the GAP Benchmark for analytics workloads. The performance metric is throughput mea-
Suite (GAPBS) [8]. GAPBS is more memory bandwidth intensive sured in IOPS. The reads result in direct memory access (DMA)
and performs lighter-weight computations than Redis. writes to host memory from the storage device, leading to a large
For Redis, we use the standard sharding-based multi-core deploy- volume of P2M write traffic. While DDIO minimizes P2M traffic for
ment setup [12]—multiple independent Redis server instances (each many workloads by servicing DMA requests from the cache instead
with its own keyspace) running on a dedicated set of cores. Clients of memory, it is well known that it is not effective for all work-
run on a different set of dedicated cores (1 client core per server loads [10, 18, 66]. Our P2M workload is in the latter category—due
core) and issue queries to the server instances using Unix domain to the large sequential requests, the application buffers do not fit into
sockets (the most efficient inter-process communication mechanism the small portion of the cache that DDIO is allowed to use [18], thus
supported by Redis [58]). We use the standard YCSB-C (100% read) leading to cache misses and evictions/writebacks for every DMA in
workload with a uniform random access pattern; as is standard, steady state. Therefore, we observe nearly the same average memory
clients issue queries with parallelism given by the knee-point of bandwidth utilization for this workload with/without DDIO.
the latency-throughput curve. Performance is measured in terms of In the following experiments, we first run each of the C2M and
the throughput (queries/sec). The working set size per server core P2M apps in isolation. We then colocate them and measure the
is 1 million key-value pairs with a 1KB value size, exceeding the performance degradation for each app. We start with the Ice Lake
system Last-Level Cache (LLC) even for a single server core. As setup (Table 1). We partition cores between the applications by
a result, the observed cache miss ratio is >95%, and a large num- pinning each to a separate set of cores—we dedicate 4 cores to the
ber of C2M memory reads are generated. For GAPBS, we run the P2M app (which is more than sufficient to saturate PCIe bandwidth
PageRank workload on a random graph of 225 nodes and degree 16, without compute being a bottleneck) and run the C2M app on the
using the GAPBS default parameters. Performance is measured in remaining cores with a varying number of cores. We enable DDIO
terms of execution time (lower is better). A single graph instance is and hardware prefetching on this setup.
shared across all the cores. The workload has a ∼5GB memory foot-
C2M app performance degrades even when memory bandwidth
print, significantly larger than the cache, resulting in a large number
is far from saturated. When Redis (C2M) and FIO (P2M) are
of random C2M memory reads. We focus on non-cache-resident
colocated, as shown in Figure 1(a), Redis observes throughput degra-
memory-intensive workloads since the phenomenon was observed
dation (1.25 − 1.32×) while FIO remains unaffected. The surprising
in datacenters for similar workloads [1, 42].
observation here is that degradation is observed even though cores
We use a lightweight storage-based P2M app, FIO [7], that per-
and PCIe bandwidth are isolated across the applications and memory
forms storage accesses with minimal computational overhead. We
bandwidth utilization is far from saturation as shown in Figure 1(c)
configure it to perform sequential reads with 8MB request sizes. This
(ranging between 33 − 53% of the theoretical maximum, and the
is representative of storage workloads that perform large sequential
utilization curve has not flattened out). Although the LLC is shared
C2M traffic P2M Traffic C2M Traffic P2M Traffic Theoretical Max
P2M-Write P2M-Read
(isolated/colocated)

(isolated/colocated)
Memory Bandwidth

Memory Bandwidth
Utilization (GB/s)

Utilization (GB/s)
2 50 2 50
Degradation

Degradation
1.8 40 1.8 40
C2M-Read

1.6 30 1.6 30
1.4 20 1.4 20
1.2 10 1.2 10
1 0 1 0
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
No. of C2M Cores No. of C2M Cores No. of C2M Cores No. of C2M Cores
(isolated/colocated)

(isolated/colocated)
Memory Bandwidth

Memory Bandwidth
Utilization (GB/s)

Utilization (GB/s)
2 50 2 50
C2M-ReadWrite
Degradation

Degradation
1.8 40 1.8 40
1.6 30 1.6 30
1.4 20 1.4 20
1.2 10 1.2 10
1 0 1 0
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
No. of C2M Cores No. of C2M Cores No. of C2M Cores No. of C2M Cores

Figure 3: Blue and red regimes across four quadrants (1-4 are shown on top-left, top-right, bottom-left, bottom-right respectively). Quadrants are
shaded with the color of the regime they show. For each quadrant, the left column shows the throughput degradation observed by C2M and P2M (ratio
of throughput when run in isolation to the throughput when colocated) and the right column shows the memory bandwidth utilization when they are
colocated broken down by C2M and P2M traffic.

between the applications, it does not play much of a role in de- worse performance degradation for C2M apps for both Redis and
termining performance, since both C2M and P2M traffic observe GAPBS (Figures 2(a), 2(b)). This is surprising because our C2M
nearly 100% cache miss ratio even when they are run in isolation. workloads already have ∼100% cache miss ratio even when run in
We investigate the root cause of this performance degradation in §5. isolation, and thus should ideally not be impacted by cache evictions
Using GAPBS rather than Redis results in the same high-level caused by DDIO. We do not know how to explain this observation.
observation—C2M performance degrades (1.28 − 1.98×) while P2M
is unaffected (Figure 1(b)). This again happens even when memory 2.2 The Blue and Red Regimes
bandwidth is far from saturated (as shown in Figure 1(d) for fewer
than 15 GAPBS cores). As one would intuitively expect, the magni- Our experiments in §2.1 use real open-sourced apps that have fixed
tude of performance degradation is larger for GAPBS compared to memory access patterns. We now switch to using lightweight apps
the Redis since it is more memory intensive—Redis spends only a with easy to control memory access patterns, enabling us to perform
part of its time stalled on memory accesses, while GABPS is stalled a deeper characterization of performance degradation trends and to
on memory accesses nearly all of the time. study different combinations of read/write for C2M/P2M apps.

Impact of prefetching, processor generations, and resource ra- Workloads. To generate C2M traffic, we use a modified version
tios. Given the random access nature of the Redis and GAPBS work- of the STREAM [46] benchmark that supports different read/write
loads, hardware prefetchers have little to no impact on performance. ratios. We use two C2M workloads: (1) a read-only workload that
For both workloads, we found < 5% difference in performance sequentially reads data from a 1GB buffer (using 64-byte AVX512
when comparing prefetch on/off configurations in both isolated and load instructions). This results in 100% memory reads (C2M-Read).
colocated cases. We repeat the above experiments on the Cascade (2) a write workload that sequentially writes data to a 1GB buffer
Lake setup (Table 1). Here, we dedicate 2 cores to FIO and run (using 64-byte AVX512 store instructions). This generates 50% read
the C2M app on the remaining cores. The corresponding results and 50% write memory traffic since every cacheline is first read
are shown in the (DDIO on) curves of Figure 2. We see the same into the CPU’s cache before the store instruction can be serviced
general observation as in the Ice Lake setup—C2M app performance and is later written back to memory during cache eviction (C2M-
degrades while the P2M app observes no degradation, thus showing ReadWrite). For P2M, we run FIO with (1) 100% storage reads
that our observations apply across different processor generations which translates to 100% memory writes (P2M-Write) and (2) 100%
and resource ratios. Similar observations apply even when using storage writes which translates to 100% memory reads (P2M-Read).
different read/write ratios for the C2M and P2M applications (results Both our C2M and P2M workloads perform sequential accesses.
presented in [68]). We run experiments on our Cascade Lake setup while colocating
each of the two C2M workloads with each of the two P2M work-
DDIO can worsen performance degradation when app working loads. This results in a total of four scenarios, which we refer to as
size does not fit in cache. As discussed previously, DDIO is not the four “quadrants”. We disable prefetching and DDIO for better
effective for our P2M workload and does not reduce its memory explicability. We found that while enabling each of these leads to
bandwidth utilization when run in isolation. To study if DDIO has different absolute degradation numbers, the trends and takeaways
any second-order impact when C2M/P2M apps are colocated, we remain the same (when memory bandwidth is not saturated, prefetch-
re-ran the above experiments with DDIO disabled on our Cascade ing improves C2M throughput in both the isolated and colocated
Lake setup (this was not possible on our Ice Lake setup because cases, but their ratio remains roughly the same). The degradation
DDIO is permanently enabled there). The corresponding results, observed in the quadrants is shown in Figure 3. We classify the
shown in Figure 2, reveal a surprising observation: DDIO results in observations into two key regimes:
Blue regime: C2M throughput degrades while P2M through- CPU
put does not. In quadrant 1 (C2M-Read, P2M-Write), we observe L1 L1 L1
1
1.2 − 1.7× degradation in C2M throughput while P2M through- LFB
put remains unaffected. This degradation happens when memory 2

bandwidth is far from saturated (for example, with a single C2M L2 L2 L2 Memory
core), and increasing load leads to worse C2M throughput. Simi- 3 Controller DRAM

larly, in quadrant 2 (C2M-Read, P2M-Read), while C2M throughput CHA 4 RPQ


LLC
degrades, P2M throughput remains unaffected. C2M throughput 3 WPQ

degradation is lower than quadrant 1. Quadrant 4 (C2M-ReadWrite, DRAM


2 Memory
Banks
P2M-Read) observes the same trend as in quadrant 2. IIO Channel
PCIe 1
Red regime: Both C2M and P2M throughput degrade. Quadrant
Peripheral
3 (C2M-ReadWrite, P2M-Write) shows a range of different perfor-
mance degradation trends. With 2 or fewer C2M cores, similar to Figure 4: Host network architecture, and C2M and P2M datapaths.
quadrants 1 and 2, C2M throughput degrades while P2M throughput Details in §3.
does not. For 3 C2M cores and above, once memory bandwidth is
saturated, we see a completely different trend—C2M traffic now an- 3 BACKGROUND
tagonizes P2M by getting an increasingly larger share of the memory In this section, we provide a brief primer on the host network archi-
bandwidth with increasing load, leading to larger throughput degra- tecture and the datapath for C2M and P2M requests.
dation for P2M traffic compared to C2M traffic. This captures the Figure 4 shows the host architecture: it consists of cores (with pri-
observations reported by recent works [1, 42]. P2M traffic, however, vate L1/L2 caches), the LLC, the Caching and Home Agent (CHA),
does not get starved at higher load—with 5 and 6 C2M cores, we the Integrated IO controller (IIO), and the Memory Controller (MC)
observe a relative stabilization of the memory bandwidth shares of all connected by the on-chip processor interconnect. The CHA ab-
C2M and P2M traffic. stracts away the LLC and memory from the rest of the system while
maintaining cache coherence2 . Peripheral devices are attached to the
2.3 Networking Case Studies IIO through the peripheral interconnect (typically PCIe). DRAM con-
sists of a set of modules (Dual Inline Memory Modules, or DIMMs),
Our characterization of host contention regimes in §2.2 generalizes
each of which is attached to the MC through a memory channel.
to cases where P2M traffic is generated by a NIC instead of local
These memory channels constitute the memory interconnect. For
storage devices, and networked transfers use either kernel-based or
simplicity, in Figure 4, we show a single module attached through a
hardware-offloaded transport mechanisms. We briefly summarize
single memory channel.
these observations below; full details are presented in [68].
C2M datapath. CPU-to-memory reads are generated by cores upon
RDMA. Using RDMA over Converged Ethernet with Priority Flow
a cache miss through the following data path:
Control (RoCE/PFC), we observe the same blue and red regime
trends from §2.2 for each of the C2M/P2M read/write combinations— 1 Upon an L1 cache miss, an entry is allocated in the core’s Line
RoCE/PFC throughput degrades in the red regime; on the other hand, Fill Buffer (LFB), and a request is sent to the L2 cache.
in the blue regime, RoCE/PFC throughput remains unaffected and 2 Upon reaching the L2 cache, the cacheline is either served from
causes significant C2M app throughput degradation. the L2 cache upon a cache hit (and the LFB entry is freed) or
the request is sent to the CHA upon a cache miss.
DCTCP. With Linux DataCenter TCP (DCTCP) over lossy fabric,
we again observe the same blue and red regimes, although the ob- 3 The CHA serves the cacheline from the LLC if there is an LLC
served application-level performance trends are slightly different. In hit (and the LFB entry is freed). Otherwise, the CHA sends
particular, the networked application observes performance degra- the request to the MC, where it is queued in the Read Pending
dation in each regime. This is because, in addition to P2M traffic, Queue (RPQ).
the networked application also generates C2M traffic due to the data 4 The MC fetches the required cacheline from DRAM over the
copy between application buffers and kernel socket buffers. In the memory channel and returns the data back to the core while
blue regime, C2M throughput degradation slows down data copy populating the caches and ultimately freeing the corresponding
processing, resulting in CPU bottleneck; this results in DCTCP flow LFB entry.
control kicking in, reducing the P2M traffic load. In the red regime, Memory writes are generated upon cache evictions and follow a
P2M throughput degrades but no congestion signal is sent back to similar datapath: a write-back from the L2 cache is sent to the CHA,
the sender until packets are dropped at the NIC; this results in further which either services it from the LLC or sends a write request to the
throughput degradation, latency inflation, and violation of isolation MC, where it is queued in the Write Pending Queue (WPQ). The
properties as outlined in [1, 2]. MC eventually issues the write to DRAM over the memory channel.
Given that both the setups—P2M traffic from within the host, Importantly, unlike reads, writes generated by cores are asynchro-
and P2M traffic from datacenter network transfers—lead to similar nous: the CPU only has to wait for the request to be admitted to the
observations, we focus on the former setup for the rest of the paper. It 2 Both the CHA and LLC are physically distributed into multiple slices. Based on the
makes it easier to describe all our results. We present corresponding physical address, memory requests are routed to the correct slice. For simplicity, we
results for the RDMA and DCTCP scenarios in [68]. represent the CHA/LLC as a single logical entity.
C2M-Read Domain C2M-Write Domain
CHA, and the CHA only has to wait for the request to be admitted CPU CPU
to the WPQ. LFB LFB

CHA MC CHA MC
P2M datapath. Each peripheral-to-memory request (read/write) IIO
DRAM
IIO
DRAM

incurs the following datapath: Peripheral Peripheral

P2M-Read Domain P2M-Write Domain


1 Peripheral device initiates a DMA request to the IIO, that allo- CPU CPU
LFB LFB
cates an entry in the IIO (read/write) buffer per cacheline.
CHA MC CHA MC
2 IIO forwards the requests (at cacheline granularity) to the CHA. DRAM DRAM
IIO IIO
If DDIO is enabled and there is a cache hit, the CHA serves Peripheral Peripheral

the request from the LLC (and the IIO buffer entry is freed).
Otherwise, the CHA sends the request to the MC, where it is Figure 5: Domain-by-domain credit-based flow control in the host net-
enqueued in the RPQ/WPQ. work: Different shaded regions within each sub-figure depict indepen-
3 The MC serves read/write requests in a manner similar to the dent domains within the host network for respective C2M and P2M
C2M datapath. After a read is serviced from DRAM, the data is read/write datapaths. Data is transmitted between the CPU/Peripheral
returned to the IIO, at which point the IIO buffer entry is cleared and the DRAM by traversing each domain using an independent credit-
based flow control mechanism. The specific domain highlighted in color
and the data is sent back to the peripheral device. For writes,
for each datapath is particularly interesting: these are the domains that
the IIO only needs to wait until the request is admitted to the
will turn out to be the bottleneck within individual datapaths. Different
WPQ before freeing its buffer entry. domains can span different numbers of hops, leading to different domain
latencies. Further, the number of domain credits (limited by the node
The interconnects within the host physically implement hop-by-
marked as yellow) is also different for P2M vs C2M domains.
hop flow control mechanisms to ensure losslessness [29]. In the
peripheral interconnect, this is implemented through the exchange of
PCIe credits between the peripheral device and the IIO [2, 54]. The how we reverse engineered the Intel host architecture to characterize
peripheral device needs a PCIe credit to send a request to the IIO; this each domain, its credits, and its unloaded latency.
credit is replenished once the corresponding IIO buffer entry is freed.
In the memory interconnect, flow control happens implicitly through 4.1 Domain-by-domain credit-based flow control
DRAM timing constraints [35, 41]. In the processor interconnect, We begin by defining domain-by-domain credit-based flow control.
implementation details of credit exchange are not public. The host network is logically decomposed into multiple domains,
DRAM operation. MC reads/writes cachelines from/to DRAM each of which is a sub-network of the host network. Each domain
over memory channels. Each memory channel can only transmit uses an independent credit-based flow control mechanism. Specif-
data in one direction (either reads or writes) at any point in time. ically, the sender of each domain is assigned credits that are used
The MC, therefore, operates in two separate modes, read mode to limit the number of in-flight requests that the sender can inject
and write mode, and maintains separate queues for reads and writes into the domain; the sender consumes a credit to send one request,
(RPQ and WPQ, respectively) per memory channel. Due to electrical and this credit is replenished when the request is acknowledged by
constraints, switching between modes takes a certain delay (called the receiver of the domain. Depending on the number of credits, at
the switching delay) during which the channel is idle [41]. The data any given point of time, there can be multiple concurrent in-flight
in each DRAM module is organized into multiple banks. Each bank requests within each domain.
has multiple rows, each of which stores a fixed number of cachelines, Intuitively, domain-by-domain credit-based flow control general-
and a row buffer that can buffer a single row at any time. In order to izes the two flow control mechanisms studied in classical computer
access a cacheline, its corresponding row needs to be present in the networking literature. On the one hand, end-to-end flow control
corresponding bank’s row buffer. If not, this results in a row miss, mechanisms (e.g., used in TCP and in receiver-driven datacenter
which incurs additional processing delay at the banks: The row needs transport protocols [9, 22, 25, 26]) are a special case where the
to be loaded into the row buffer using an Activate (ACT) operation. entire path between a sender-receiver pair is a single domain. On
If the row buffer contains a different row (i.e., row conflict), then it the other hand, hop-by-hop credit-based flow control mechanisms
needs to be flushed using a Precharge (PRE) operation before a new (e.g., used in ATM networks [32, 38, 39] and PFC-enabled RDMA
row can be loaded, which incurs additional overhead. networks [43, 73]) are a special case where each hop along the path
For the remainder of the paper, we focus on the scenario where between the sender-receiver pair is a domain.
C2M/P2M requests result in misses at all levels of the cache hierar- Different domains within the host network have different numbers
chy (as is the case in the §2 experiments). of credits and different unloaded latencies. Each request, depending
on the source (compute or peripheral device) and on the type (read
or write), traverses a different set of domains. Figure 5 shows the
4 UNDERSTANDING THE HOST NETWORK domains within the host network for each of the C2M and P2M
This section presents domain-by-domain credit-based flow control, read/write datapaths, with cores, peripheral devices, and DRAM
a conceptual abstraction that captures the interplay between the as endpoints and intermediate components (i.e., LFB, IIO, CHA,
processor, memory, and peripheral interconnects within the host net- and MC) as network nodes. We now discuss the four domains that
work. We start by describing the abstraction along with the various turn out to be the most important ones (in that, these will be the
domains and their characteristics in §4.1. We then describe in §4.2 “bottleneck” domains in individual datapaths):
• C2M-Read Domain: This domain spans all hops from LFB to measured 𝑂 and 𝑅 values (𝐿 = 𝑂/𝑅). We use the umask and opcode
DRAM. For each request, credit is allocated at the LFB and re- filtering capabilities of CHA counters to classify requests based on
plenished once the request is serviced by DRAM and the data is their source (CPU/Peripheral) and type (read/write), allowing us to
returned to the LFB. capture all the above metrics on a per-domain basis.
• P2M-Read Domain: This domain spans all hops from IIO to
C2M-Read. The C2M-Read domain spans all hops from LFB to
DRAM. For each request, credit is allocated at the IIO and re-
DRAM because an LFB entry (and corresponding credit), once al-
plenished once the request is serviced by DRAM and the data is
located, is only freed (and corresponding credit replenished) once
returned to the IIO.
the memory read request is serviced from DRAM and returned to
• C2M-Write Domain: This domain spans only a single hop from
the core to prevent duplicate memory requests to the same cache-
LFB to CHA. For each request, credit is allocated at the LFB and
line [30, 67]. To validate this, we perform latency measurements
replenished once the request reaches the CHA.
while running the C2M-Read workload (§2.2) with varying number
• P2M-Write Domain: This domain spans two hops from IIO to
of cores. Figure 6(a) shows the measured LFB latency (time be-
MC. For each request, credit is allocated at the IIO and replenished
tween allocation and replenishment of an LFB credit) alongside the
once the request reaches the MC.
CHA→DRAM read latency (time taken for request to traverse from
The maximum throughput (𝑇 ) for any domain is bound by 𝑇 ≤ 𝐶 ×64 𝐿 , CHA to DRAM and for response to return to CHA). As is evident
where 𝐶 is a constant representing the hardware-specific number of from the figure, the LFB latency is always strictly greater than the
credits available to the sender in the domain (in terms of cachelines), CHA→DRAM read latency. Further, the inflation in LFB latency
64 is cacheline size in bytes, and 𝐿 is a variable representing the from 1 to 6 cores near perfectly matches inflation in CHA→DRAM
latency required to traverse all hops within the domain. Each of these read latency. This shows that the LFB latency is inclusive of the
factors can be different depending on the domain: CHA→DRAM read latency, thus providing evidence that the C2M-
Read domain includes all hops until DRAM. In all of our experi-
Domain Credits (C). The number of credits for the C2M-Read and
ments, the maximum measured LFB occupancy is between 10 − 12
C2M-Write domains is limited by the LFB size. The number of
providing evidence that this the number of domain credits (also
credits for the P2M-Write domain is limited by the IIO write buffer
corroborated in [15]). The unloaded domain latency is ∼70ns, as is
size. The number of credits for the P2M-Read domain is limited by
evident from the single-core data point in Figure 6(a).
the IIO read buffer size. For our servers, these numbers are 10 − 12,
∼92, and >164 cachelines, respectively. C2M-Write. It is clear that the C2M-Write domain includes the
LFB, the CHA (since each domain must span at least two nodes),
Domain Latency (L). Different domains span a different subset of
and that it does not include DRAM (since writes are serviced to
network hops; this could result in different domains having different
DRAM asynchronously [29]). The key challenge lies in determining
latencies for two reasons. First, simply due to spanning a differ-
whether the MC is part of the domain. To do so, we perform latency
ent subset of network hops, different domains may have different
measurements while running the C2M-ReadWrite workload (§2.2)
unloaded latencies. Second, when C2M and P2M traffic contend
with varying number of cores. For this workload, the LFB latency is
for host network resources, queueing at the contention point may
equal to the sum of the C2M-Read and C2M-Write domain latencies
have different impact on different domains (only those domains are
(and thus must be strictly greater than each of them). If the C2M-
impacted that contain the contention point). As a result, contention
Write domain included the MC, then the C2M-Write latency (and
within the host network may result in latency inflation for some
consequently LFB latency) would always be strictly greater than the
domains but not others.
CHA→MC write latency (time taken for the request to traverse from
Given the number of credits and latency for a domain, the maximum CHA to MC). However, as shown in Figure 6(b), the CHA→MC
throughput of that domain is given by the expression 𝑇 ≤ 𝐶 ×64
𝐿 , as
write latency can exceed the LFB latency (e.g., with 6 C2M cores),
discussed above. The overall end-to-end throughput of a particular thus implying that the C2M-Write domain does not include the
C2M or P2M app is the minimum throughput across all domains MC. Subtracting the unloaded C2M-Read domain latency from the
along the datapath for that app. LFB latency at the single core data point in Figure 6(b) gives us an
estimate of ∼10ns unloaded latency for the C2M-Write domain.
4.2 Evidence on domains and their characteristics P2M-Write. To understand P2M-Write domain, we run a low-load
We now present evidence for domains and their characteristics, in- P2M workload performing 4KB storage read requests with queue
cluding a discussion of how we reverse-engineered several of the depth of 1, colocated with C2M-ReadWrite workload. Figure 6(c)
details by piecing together information from processor manuals shows the IIO latency (time between credit allocation and replen-
[29, 30] and conducting careful measurements. ishment at IIO) alongside the CHA→MC write latency. We make
three observations. First, the unloaded domain latency is ∼300ns.
Measurement Methodology. We use the Intel uncore performance
Second, the IIO latency is always larger than the CHA→MC write
monitoring counters [29] to capture average queue/buffer occupancy
latency. Finally, the inflation in IIO latency with increasing load near
(𝑂) and average request arrival rate (𝑅) metrics at different nodes in
perfectly matches the inflation in CHA→MC write latency (Fig-
the host network. In particular, we program the counters so that their
ure 6(d)), indicating that IIO latency is inclusive of the CHA→MC
values are aggregated in hardware every clock cycle and sample
write latency. This provides evidence that unlike C2M-Write domain,
them at runtime in software every 1 second, which entails very low
P2M-Write domain includes the MC. To determine the P2M-Write
overhead. To compute average latency, we apply Little’s law on the
LFB Latency LFB Latency IIO Latency IIO Latency Inflation
CHA−>DRAM Read Latency CHA−>MC Latency Inflation

Latency inflation (ns)


CHA−>DRAM Read Latency CHA−>MC Write Latency
120 CHA−>MC Write Latency 600 300
300
Latency (ns)

Latency (ns)
100 500 250

Latency (ns)
250
80 200 400 200
60 150 300 150
40 100 200 100
20 50 100 50
0 0 0 0
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
No. of C2M cores No. of C2M cores No. of C2M cores No. of C2M cores

Figure 6: Evidence for domains, and per-domain characteristics (a-d, left-right). All y-axis values are on average. Discussion in §4.2.

domain credits, we run P2M-Write workload from §2.2 (which satu- domain. Interestingly, such queueing happens far before memory
rates PCIe bandwidth), and apply maximum possible C2M load. We bandwidth is saturated; we now explain this phenomenon.
find that IIO write buffer occupancy saturates at ∼92, giving us the Queueing at the MC before memory bandwidth saturation hap-
size of the IIO write buffer. pens due to a combination of two DRAM-level factors: (1) row
misses and (2) load imbalance across banks. Row misses result
P2M-Read. The P2M-Read domain spans all hops from IIO to
in processing delays at the banks (due to precharge/activate oper-
DRAM because PCIe reads are non-posted transactions [54]—the
ations), which must be completed before the data can be accessed
IIO needs to wait until reads are serviced from DRAM and data
and transmitted over the memory channel. Even for a workload
is returned before issuing PCIe completion and replenishing the
with 100% row miss ratio, the bank-level processing delays can still
credits. We were not able to measure IIO read buffer occupancy
be hidden behind data transmission over the memory channel, if
on our server (and consequently IIO read latency), thus precluding
requests are load balanced perfectly across the banks. If requests
us from determining the precise number of credits and unloaded
are perfectly distributed across 𝑁𝑏 banks, then the bank processing
latency of the P2M Read domain. However, we obtain a lower bound
delay can be hidden/overlapped behind transmission on the channel
on the P2M-Read domain credits using measurements at the CHA
if 𝑡 Proc /𝑁𝑏 < 𝑡 Trans , where 𝑡 Proc is the per-request bank processing
(since the number of in-flight P2M-Read requests at the CHA cannot
delay and 𝑡 Trans is the per-request transmission delay over the mem-
exceed the P2M-Read domain credits). By introducing C2M load
ory channel. This condition holds for the DRAM modules in our
colocated with P2M-Read traffic, we found that the number of in-
setup, where 𝑡𝑃𝑟𝑜𝑐 ≈ 45𝑛𝑠, 𝑁𝑏 = 32, and 𝑡𝑇 𝑟𝑎𝑛𝑠 = 2.73𝑛𝑠. In reality,
flight P2M-Read requests at the CHA saturates at ∼164 cachelines,
however, load balancing is far from perfect since memory addresses
providing evidence that the P2M-Read domain has a larger number
are mapped to banks through a static hash function [56], which does
of credits than the P2M-Write domain.
not guarantee perfect load balancing [71]. As a result, requests can
5 UNDERSTANDING CONTENTION WITHIN be blocked on bank processing even when the memory channel is
idle, thus causing queueing even when channel capacity (i.e., mem-
THE HOST NETWORK ory bandwidth) is not saturated. We quantify row misses and load
We now provide an in-depth explanation for the two regimes ob- imbalance, focusing on the single core C2M case in quadrant 1 next.
served in §2 using the lens of domain-by-domain credit-based flow In the absence of P2M traffic, the row miss ratio for C2M-Read
control. We use the same measurement methodology as in §4.2. We requests is very low (< 4%, shown in Figure 7(c)). This is because
observe no statistically noticeable change in application performance of the sequential access pattern resulting in a good row locality.
when counter sampling is enabled. We disable dynamic scaling of Colocating the P2M workload causes a significant increase in row
core frequency to avoid variation in measurements (we observe less miss ratio (up to 4×). This is because the C2M and P2M workloads
than 1.5% variation across all runs for all counters). access different address spaces — intermixing them reduces row
locality, leading to a higher row miss ratio. While row miss ratio
5.1 Understanding the blue regime also increases for C2M-only traffic from multiple cores since each
We first focus on quadrant 1 (C2M-Read, P2M-Write) since it cap- core accesses a different address space, colocating P2M traffic leads
tures most of the takeaways in terms of explaining the blue regime. to a larger increase in row miss ratio as is evident in Figure 7(c).
For quadrant 1, we first explain why C2M throughput degrades and To measure load distribution, we sample the number of read
then why P2M throughput does not degrade. requests mapped to each individual bank every 1000 requests3 . Let
the bank deviation of a given sample be the ratio of the load of the
C2M throughput degrades because domain credits are fully uti- maximally loaded bank to the average load across banks. Figure 7(d)
lized and domain latency increases. Even when the C2M workload shows the CDF of bank deviation across 10000 samples. We see
is run in isolation, the corresponding domain credits are fully utilized. significant load imbalance, both with and without P2M traffic — the
This is because each core can issue instructions fast enough to keep bank deviation is ≥ 1.5× in 50 − 70% of samples and ≥ 2× in as
the LFB full (e.g., a core with 3GHz frequency can issue instructions many as 13 − 22% of samples (although there is load imbalance even
every 0.3ns, which is more than two orders of magnitude smaller when C2M is run in isolation, it is not a problem since the row miss
than the minimum C2M-Read domain latency). As a result, any ratio is very low causing bank processing delays to be negligible).
non-zero increase in domain latency will result in throughput degra-
dation. Indeed, when the P2M workload is colocated, we observe 3 Forthese measurements, we use a dedicated core that busy polls on MC hardware
1.26 − 1.8× increase in domain latency (Figure 7(a)) due to queue- counters [29]. Given constraints on the number of available hardware counters, we focus
ing at the MC (Figure 7(b)) since DRAM is part of the C2M-Read on 4 banks within a single DRAM module.
With P2M Without P2M Without P2M With P2M Without P2M With P2M Without P2M With P2M

Row Miss Ratio (%)

Fraction of samples
C2M Latency (ns)

200 25

RPQ Occupancy
30 1.0
150 20 0.8
15 20 0.6
100
10 10 0.4
50 5 0.2
0 0 0 0.0
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 1.2 1.4 1.6 1.8 2 2.2 2.4
No. of C2M cores No. of C2M cores No. of Memory antagonist cores Bank load deviation (max/avg)

With C2M Without C2M Fraction of time WPQ is filled Observed Limit

Fraction of time (%)


P2M Latency (ns)

400 100 100

IIO Credit Usage


(cache lines)
300 80 80
60 60
200
40 40
100 20 20
0 0 0
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
No. of C2M cores No. of C2M cores No. of C2M cores

Figure 7: Results for understanding quadrant 1 (a-d, top/left-right and e-g, bottom/left-right). All y-axis values are on average. Discussion in §5.1.

Returning to our main discussion, P2M traffic observes different colocated with C2M (Figure 8(c)); this, in combination with load
behavior compared to C2M traffic in quadrant 1 due to differences imbalance across banks, results in queueing at MC before mem-
in: (1) domain latency and (2) domain credits. ory bandwidth is saturated (Figure 8(b)). While there is a small
(∼20 − 30ns) inflation in P2M-Write domain latency (Figure 8(d)),
Write domain does not include execution latency of DRAM,
akin to quadrant 1, there is no P2M throughput degradation due to
leading to smaller latency inflation relative to reads. Unlike the
spare domain credits (Figure 8(f)).
C2M-Read domain, where queueing at the MC leads to domain
With 3 or more C2M cores, when memory bandwidth becomes
latency inflation, since the P2M-Write domain does not include
saturated, we observe two new trends in this quadrant (§2.2). First,
traversing DRAM, its domain latency only increases when the MC
from 3 to 4 C2M cores, we observe C2M antagonize P2M (i.e., reduc-
write queue becomes full. As shown in Figure 7(f), the fraction of
tion in C2M throughput degradation coupled with a large increase
time the WPQ is filled is near zero when P2M-Write is colocated
in P2M throughput degradation). Second, beyond 4 C2M cores, the
with a single C2M-Read core. Thus, there is no domain latency
rate of degradation of P2M throughput reduces with increasing C2M
inflation for P2M-Write (Figure 7(e)). With increasing C2M cores,
cores. We now discuss the underlying reasons for both observations.
while we see a small increase (< 25ns) in domain latency for P2M-
Write (Figure 7(e)), it is smaller than what is seen for C2M-Read. Backpressure from MC impacts P2M-Write domain but not
This is because the WPQ starts to get filled up occasionally (< 30% C2M-Write domain, leading to throughput degradation for the
of the time), as shown in Figure 7(f). former. When memory bandwidth is saturated, unlike in quadrant 1
(Figure 7(f)), the MC write queue gets filled up persistently (75% of
P2M domain can tolerate latency inflation due to availability
the time with 3 C2M cores, and nearly all the time with 4 or more
of spare domain credits. Increased P2M-Write domain latency,
C2M cores; Figure 8(e)), thus leading to backpressure and causing
however, does not lead to a reduction in P2M throughput. This
a backlog of writes at the CHA. Interestingly, this only impacts the
is because, unlike C2M-Read, when the P2M workload is run in
P2M workload, but not the C2M workload, even though both are per-
isolation, domain credits are not fully utilized (as described in §4,
forming writes. This is because the P2M-Write domain spans the MC
unloaded P2M-Write domain latency is ∼300ns; therefore, to saturate
while the C2M-Write domain does not (§4); therefore, backpressure
PCIe bandwidth of ∼14GB/s, ∼65 credits are needed which is smaller
from the MC results in domain latency inflation of the P2M-Write
than the ∼92 available credits). P2M throughput degrades only after
domain but not C2M-Write domain. From 3 to 4 C2M cores, due to
domain credits are exhausted. Indeed, we see a slight increase in
backlogging of writes, P2M-Write domain latency increases by 1.5×
domain credit utilization with increasing C2M cores, but it is well
(Figure 8(d)) and results in significant P2M throughput degradation
below the maximum limit (Figure 7(g)). Therefore, the P2M-Write
since the domain credits are fully utilized (Figure 8(f)). The C2M
domain is able to maintain enough in-flight write requests to mask
workload (C2M-ReadWrite), however, is only bound by C2M-Read
the latency inflation and thus avoid throughput degradation.
domain latency (as C2M-Write domain latency does not increase)
Our explanation of quadrant 1 generalizes to quadrants 2 and 4
which only increases by ∼12% (Figure 8(a)) since reads are not im-
(the corresponding measurements are presented in [68]).
pacted by write backlogging as they can be processed concurrently
at the CHA even when writes are blocked. As a result, C2M’s share
5.2 Understanding the red regime of memory bandwidth increases while P2M’s share reduces.
We now turn our attention to quadrant 3, where both C2M and P2M
Backpressure from CHA impacts both C2M and P2M domains,
observe throughput degradation. Before diving deep, we first briefly
leading to increased degradation for both. As the write backlog at
revisit the observations in quadrant 3 bearing similarities to §5.1.
the CHA continues to increase with increasing C2M load, we observe
With 2 or fewer C2M cores, similar to quadrant 1, C2M through-
a new phenomenon that is evident beyond 4 C2M cores: CHA begins
put degrades even before memory bandwidth is saturated (Figure 3).
to apply backpressure due to limited buffering resources. The trend in
We see a similar trend of increase in row miss ratio when P2M is
With P2M Without P2M Without P2M With P2M Without P2M With P2M With C2M Without C2M

Row Miss Ratio (%)


C2M Latency (ns)

P2M Latency (ns)


400 25 50 1000

RPQ Occupancy
300 20 40 800
15 30 600
200
10 20 400
100 5 10 200
0 0 0 0
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
No. of C2M cores No. of C2M cores No. of C2M cores No. of C2M cores

Fraction of time WPQ is filled Observed Max limit

Fraction of time (%)


100 100

IIO Credit Usage


(cache lines)
80 80
60 60
40 40
20 20
0 0
1 2 3 4 5 6 1 2 3 4 5 6
No. of C2M cores No. of C2M cores

Figure 8: Results for understanding quadrant 3 (a-d, top/left-right and e-f, bottom/left-right). All y-axis values are on average. Discussion in §5.2.

Figure 8(b) provides evidence of this—the average RPQ occupancy number of memory requests rather than on individual request dynam-
saturates beyond 4 C2M cores (and is lower than the corresponding ics, which are difficult to capture. Before describing the analytical
without P2M values), showing that the number of in-flight read formula, we highlight that it is not designed to be perfect—it does
requests from the CHA to the MC has been capped despite the total not capture all the intricacies of DRAM operation, including out-of-
number of in-flight read requests increasing with more C2M cores. order request scheduling, resource contention at some levels of the
This indicates that some requests are getting blocked at the cores DRAM hierarchy (e.g., ranks and bank groups), low-level hardware
even before being admitted into the CHA—a result of backpressure optimizations (e.g., opportunistic processing of memory writes by
from the CHA. Under CHA backpressure, write backlogging no Intel memory controllers while in read mode [59]), and a subset of
longer has a one-sided impact on the P2M domain. Latency inflation DRAM timing constraints (e.g., write recovery delays, rank-level
is now primarily determined by delay in admitting requests into the timing constraints, etc.). Despite its relative simplicity, as we will
CHA itself, which impacts both C2M and P2M domains. We see a demonstrate, it still captures latency inflation to a reasonable degree
roughly equitable increase in domain latency (∼50ns) when going of accuracy (within ∼ 10% error) in most evaluated scenarios.
from 5 − 6 cores for both the C2M and P2M domains, thus leading to We first build the analytical expression for read domain latency
a relative stabilization of their memory bandwidth shares (Figure 3). following which we discuss write domain latency.
Read Domain Latency. Our formula for read domain latency (Fig-
6 QUANTITATIVE VALIDATION
ure 9) is applicable to both the C2M-Read and P2M-Read domains.
In §5, we identified the root causes for performance degradation As motivated at the start of the section, we focus only on the average
due to interplay between processor, memory, and peripheral inter- queueing delay for reads (𝑄𝐷 read ) at the MC. We, therefore, abstract
connects, based on correlations with measurements from nodes in away all latency in the end-to-end datapath other than 𝑄𝐷 read into a
the host network. This, however, does not imply that these are the constant (Constantread ). Naturally, Constantread is different for the
only factors impacting performance degradation. To close this gap C2M-Read and P2M-Read domains since they have non-shared hops
and further validate our understanding, we now connect these mea- in their datapaths (§4). 𝑄𝐷 read is expressed as a sum of four additive
surements to the observed end-to-end throughput degradation. To components. The first three components capture the average delay
this end, we develop an analytical formula that captures the average for a given read request to reach the top of the RPQ (thus, they are
memory access latency observed by C2M/P2M traffic (which then all a function of the average RPQ occupancy 𝑂 RPQ ), and the last
directly connects to throughput using Little’s law). Our analytical component captures the additional delay that a request incurs after
formula focuses on queueing delay at the MC (for reads) and at the reaching the top of the queue before it is issued to DRAM. We now
CHA (for writes). While, in theory, there can be queueing delay at describe each of the individual formula components:
other points in the host network (e.g. in cache controllers, within the
processor interconnect, etc.), we demonstrate that queueing delay at • Switching Delay: This component captures the average time a
these other points contributes minimally to end-to-end latency (our given read is blocked due to write-to-read switching delay (𝑡 WTR ).
analytical analysis captures end-to-end performance to a high degree Since we focus on average case behavior, we can compute the
of accuracy across all evaluated workloads). We first describe our total switching cost over a large number of switches (#switches)
analytical formula (§6.1), following which, we present results of and average it over a large number of reads (linesread ).
applying it to the four quadrants (§6.2). • Write Head-of-Line Blocking: This component captures the av-
erage time for which a read is blocked because the channel is
6.1 Analytical Formula currently in write mode and cannot issue reads. Average case anal-
In designing our analytical formula, we exploit the insight that since ysis allows us to compute the total time spent in write mode over
we are analyzing latency for the purpose of understanding average a long time window (lineswritten × 𝑡 Trans ) and average it across a
throughput, we can focus on average-case behavior across a large large number of reads (linesread ).
Analytical Formula Inputs
WPQ Quadrant 1 Quadrant 2 Quadrant 4
𝑃 fill Probability that WPQ is full
30
𝑁 waiting # write requests awaiting WPQ admission 20

Error (%)
#switches # switches between read and write mode 10
linesread/write # cachelines read / written 0
𝑂 RPQ Average RPQ occupancy −10
−20
PREconflict
read/write
# precharges due to row conflicts for reads / writes −30
ACTread/write # activations for reads / writes 1 2 3 4 5 6
No. of C2M cores
Table 2: Inputs to the formula for computing latencies for C2M and P2M C2M (Formula) C2M (Formula + CHA Admission Delay)
P2M (Formula) P2M (Formula + CHA Admission Delay)
read/write domains (discussion in §6.1). 30
20

Error (%)
10
read
𝐿𝑚 = Constantread + 𝑄𝐷 read (Average read latency) 0
−10
#switches −20
𝑄𝐷 read = 𝑂 RPQ · · 𝑡 WTR (Switching Delay)
linesread −30
1 2 3 4 5 6
lineswritten
+ 𝑂 RPQ · · 𝑡 Trans (Write HoL blocking) No. of C2M cores
linesread
+ (𝑂 RPQ − 1) · 𝑡 Trans (Read HoL blocking)
Figure 11: Accuracy of the analytical formulae: (top) Error in the for-
#ACTread #PREconflict
+ · 𝑡 ACT + read
· 𝑡 PRE (Top-of-queue delay) mula’s estimate of C2M throughput for quadrants 1, 2, 4. (bottom) Error
linesread linesread in the formula’s estimate of C2M and P2M throughput for quadrant 3
(both with/without adding CHA admission delay). Positive values indi-
Figure 9: Read domain latency components (inputs defined in Table 2; cate overestimation, and negative values indicate underestimation.
𝑡 WTR , 𝑡 Trans (transmission delay: time taken to transmit a single cache-
line over the memory channel in either direction), 𝑡 ACT (𝑡 RCD ) and 𝑡 PRE Our write domain latency formula (Figure 10) captures 𝐴𝐷 write
(𝑡 RP ) are standard DRAM timing constraints). via (1) the probability that a request is blocked due to the WPQ being
fill ) and (2) the average waiting time for a request when the
full (𝑃WPQ
write
𝐿𝑚 = Constantwrite + 𝐴𝐷 write (Average write latency)
WPQ WPQ is full (𝑋 write ). For 𝑋 write , we use an expression analogous to
𝐴𝐷 write = 𝑃 fill · 𝑋 write
read queueing delay, with the parameters for reads/writes swapped
#switches
𝑋 write = 𝑁 waiting · · 𝑡 RTW (Switching Delay) (since the corresponding components for write processing are exactly
lineswritten
linesread the duals of those for reads), and using 𝑁 waiting , the average number
+ 𝑁 waiting · · 𝑡 Trans (Read HoL blocking)
lineswritten of writes (both C2M and P2M) waiting to be admitted into the
+ (𝑁 waiting − 1) · 𝑡 Trans (Write HoL blocking) queue, instead of 𝑂 RPQ (since admitting 𝑁 waiting requests requires
#ACTwrite #PREconflict
write
processing an equal number of writes to make space in the queue).
+ · 𝑡 ACT + · 𝑡 PRE (Top-of-queue delay)
lineswritten lineswritten Unlike P2M writes, C2M writes do not have to wait until they are
admitted to the MC. We do not capture inflation of C2M-Write
Figure 10: Write domain latency components (inputs defined in Table 2; domain latency and assume it to be a constant. We later discuss the
𝑡 WTR , 𝑡 Trans (transmission delay: time taken to transmit a single cache- implications of doing so.
line over the memory channel in either direction), 𝑡 ACT (𝑡 RCD ) and 𝑡 PRE
(𝑡 RP ) are standard DRAM timing constraints).
6.2 Applying the Formula
• Read Head-of-Line Blocking: A given read request has to wait All formula inputs can be captured or derived using programmable
for the requests before it (𝑂 RPQ − 1 on average) in the RPQ to uncore performance counters available on Intel servers [29] using
get transmitted on the memory channel. In reality, while the MC the same measurement methodology as §5. We use MC counters to
may schedule requests out-of-order to maximize utilization, our capture all the inputs except 𝑁 𝑤𝑎𝑖𝑡𝑖𝑛𝑔 . For 𝑁 𝑤𝑎𝑖𝑡𝑖𝑛𝑔 , we use counters
evaluation of the formula indicates that this has very little impact, from the CHA, since this is where requests are backlogged when the
if any, on the end-to-end latency for the workloads we focus on. MC write queues are full [29].
• Top-of-queue delay: Even after reaching the top of the RPQ, Applying the formula. We set 𝐶𝑜𝑛𝑠𝑡𝑎𝑛𝑡𝑟𝑒𝑎𝑑/𝑤𝑟𝑖𝑡𝑒 based on un-
a request might still have to wait for activate/precharge opera- loaded latencies of domains (§4.2). For any given experiment, we
tions to complete before it is issued to DRAM. To capture this, then apply the formula using the measured inputs to obtain the av-
we compute the total cost of activate and precharge operations erage domain latency. Depending on the workload, we use either
(#ACTread ×𝑡 ACT and #PREconflict
read
×𝑡 PRE ) and average them across the read or the write domain latency expression. For C2M-Read and
a large window of requests (linesread ). P2M-Read, we use the read domain latency expression. For P2M-
Write, we use the write latency expression. For C2M-ReadWrite,
Write Domain Latency. Writes require slightly different analysis
we use the C2M-Read domain latency plus a constant (to account
than reads: since writes do not have to wait until they are actually
for C2M-Write). After obtaining average domain latency (𝐿), we
issued and processed in DRAM. For the P2M-Write domain, they
compute estimated throughput using the expression in §4.
are completed as soon as they are admitted into the MC WPQ. Thus,
P2M-Write domain latency only inflates when the WPQ is filled, at Formula accurately captures end-to-end throughput. Figure 11
which point requests will have to wait for some time until they are shows the error in the formula’s estimate of throughput in all of
admitted (admission delay (𝐴𝐷 write )). the quadrants from §2.2. The formula captures throughput for the
Switching Top−of−queue Switching Top−of−queue Switching Top−of−queue Switching Top−of−queue Switching Top−of−queue
Write HoL Read HoL Write HoL Read HoL Write HoL Read HoL Write HoL Read HoL Write HoL Read HoL
150 150 300 CHA Admission Delay CHA Admission Delay
C2M Latency

C2M Latency

C2M Latency
Inflation (ns)

Inflation (ns)

Inflation (ns)

C2M Latency

P2M Latency
Inflation (ns)

Inflation (ns)
300 300
100 100 200 200 200
50 50 100 100 100
0 0 0 0 0
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
No. of C2M cores No. of C2M cores No. of C2M cores No. of C2M cores No. of C2M cores

Figure 12: Breakdown of analytical formula components (a-e, left-right) (a, b, c) Breakdown of formula components for C2M in quadrants 1, 2, and 4
respectively; (d, e) Breakdown of formula components for C2M and P2M along with CHA admission delay in quadrant 3.

C2M workload (which is what degrades) within 10% error for all A natural next step is to extend our study to hosts with multiple
data points in quadrants 1, 2, and 4. For quadrant 3, with 4 or fewer sockets, multiple IIOs, modern direct cache access mechanisms, and
cores, the error for both C2M/P2M is within 10%. However, beyond with a wider variety of Intel and AMD processors. Looking forward,
4 C2M cores, the error increases (especially for C2M throughput). the host network in modern datacenter hosts is becoming increas-
This is because our formula currently does not account for admission ingly complex with new interconnects such as CXL and NVLink,
delay to the CHA, which manifests when CHA buffers get filled up deeper topologies with PCIe switches, different kinds of memory
as is the case for quadrant 3 with 4 or more C2M cores and causes such as High-Bandwidth Memory, and new kinds of hardware accel-
latency inflation for both C2M and P2M domains (§5.2)—as a result, erators and data movement engines. More work needs to be done to
the formula underestimates latency leading to overestimation of understand the behavior of the host network for such hosts.
throughput. When we add the measured CHA admission delay from Our analytical formula in §6 quantitatively validates our analysis.
the testbed to the output of the formula, the error reduces to < 10% There are several possible extensions. First, our analytical formula
for all data points, thus validating our understanding. requires inputs that are measured; it would be interesting to build an
analytical model that can predict performance given a particular host
Breakdown of formula components. Figure 12 presents the break-
network hardware configuration (e.g., by extending [5]). Second,
down of queueing delay into each of the individual formula compo-
while we precisely capture the impact of contention within the host
nents for all of the quadrants. For quadrant 1, with a single C2M core,
network on application performance for our storage and RoCE/PFC
WriteHoL is the dominant contributor. With increasing C2M cores,
experiments, such is not the case for the lossy network setup—here,
both WriteHoL and ReadHoL increase. Quadrant 2 has a larger Read
packet drops and resulting congestion response lead to complex
HoL component due to higher average read queue occupancy, but it
modeling issues. Incorporating the behavior of end-to-end datacenter-
has no Write HoL component because there are no writes. In quad-
level transport protocols within the analytical model is an important
rant 4, ReadHoL is the dominant contributor for all data points. In
extension of our work. Finally, it would be interesting to build host
quadrant 3, for C2M, WriteHoL is the dominant contributor up to 4
network simulators that enable a deeper exploration of domain-by-
C2M cores, beyond which CHA admission delay starts to dominate.
domain credit-based flow and the host network in general.
For P2M, WriteHoL is the dominant factor until 3 C2M cores, after
which ReadHoL becomes dominant. Rearchitecting protocols, operating systems and host hardware.
The host network has implications that are more far-reaching than
7 DISCUSSION AND FUTURE DIRECTIONS the context of networked applications and datacenter congestion
control—it impacts application-level performance even when all
Technology trends for the host hardware suggest that performance
applications are contained within a single host. Our study thus opens
of peripheral interconnects is improving much more rapidly than
up many interesting avenues of future research in the design of
processor and memory interconnects. This has led to an increas-
protocols, operating systems and even host hardware along several
ing imbalance of resources and contention within the host network,
directions. For instance, it would be interesting to explore new mech-
which in turn, negatively impacts application-level performance.
anisms for host network resource allocation (e.g., extending ideas in
We have presented a conceptual abstraction of domain-by-domain
hostCC [2] to the case of all traffic contained within a single host),
credit-based flow control that precisely captures the interplay be-
new memory controller scheduling mechanisms to better isolate
tween processor, memory, and peripheral interconnects within the
C2M/P2M traffic (e.g., extending ideas in heterogeneous memory
host network. Using this abstraction, we have built an in-depth un-
scheduling architectures [6, 33, 34]), and new datapaths for periph-
derstanding of contention within the host network and its impact
eral traffic (e.g., using dynamic direct cache access [3, 69], or even
on application performance reported by previous studies, as well as
bypassing memory read/writes altogether [27, 65]).
identified new, previously unreported, regimes of contention within
the host network. Our study opens up several interesting avenues of ACKNOWLEDGEMENTS
future research at the intersection of computer networking, operating
We would like to thank our shepherd, Michio Honda, the SIGCOMM
systems and computer architecture. We outline some of these below.
reviewers, Qizhe Cai, and Shreyas Kharbanda for their insightful
Building an even deeper understanding of the host network. feedback. This research was in part supported by NSF grants CNS-
For instance, we focus on a simple setup: two generations of Intel 2047283 and CNS-2212193, a Sloan fellowship, gifts from Intel
processors with C2M and P2M apps contending on host network and Google, and by ACE, one of the seven centers in JUMP 2.0, a
resources within the same socket, peripheral devices connected to a Semiconductor Research Corporation (SRC) program sponsored by
single IIO, and all peripheral transfers executed with DDIO disabled. DARPA. This work does not raise any ethical concerns.
REFERENCES [31] Ravi Iyer, Li Zhao, Fei Guo, Ramesh Illikkal, Srihari Makineni, Don Newell,
[1] Saksham Agarwal, Rachit Agarwal, Behnam Montazeri, Masoud Moshref, Khaled Yan Solihin, Lisa Hsu, and Steve Reinhardt. 2007. QoS Policies and Architec-
Elmeleegy, Luigi Rizzo, Marc Asher de Kruijf, Gautam Kumar, Sylvia Ratnasamy, ture for Cache/Memory in CMP Platforms. In ACM SIGMETRICS Performance
David Culler, and Amin Vahdat. 2022. Understanding Host Interconnect Conges- Evaluation Review.
tion. In ACM HotNets. [32] Raj Jain. 1996. Congestion Control and Traffic Management in ATM Networks:
[2] Saksham Agarwal, Arvind Krishnamurthy, and Rachit Agarwal. 2023. Host Recent Advances and a Survey. In Computer Networks and ISDN systems.
Congestion Control. In ACM SIGCOMM. [33] Min Kyu Jeong, Mattan Erez, Chander Sudanthi, and Nigel Paver. 2012. A QoS-
[3] Mohammad Alian, Siddharth Agarwal, Jongmin Shin, Neel Patel, Yifan Yuan, Aware Memory Controller for Dynamically Balancing GPU and CPU Bandwidth
Daehoon Kim, Ren Wang, and Nam Sung Kim. 2022. IDIO: Network-Driven, Use in an MPSoC. In ACM/IEEE DAC.
Inbound Network Data Orchestration on Server Processors. In IEEE MICRO. [34] Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata
[4] Anthony Alles. 1995. ATM Internetworking. In Engineering InterOp. Ausavarungnirun, Mahmut T Kandemir, Gabriel H Loh, Onur Mutlu, and Chita R
[5] Mina Tahmasbi Arashloo, Ryan Beckett, and Rachit Agarwal. 2023. Formal Das. 2014. Managing GPU Concurrency in Heterogeneous Architectures. In IEEE
Methods for Network Performance Analysis. In USENIX NSDI. MICRO.
[6] Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, [35] Yoongu Kim. 2015. Architectural Techniques to Enhance DRAM
Gabriel H Loh, and Onur Mutlu. 2012. Staged Memory Scheduling: Achieving Scaling. https://kilthub.cmu.edu/articles/thesis/Architectural_Techniques_to_
High Performance and Scalability in Heterogeneous Systems. In ACM SIGARCH Enhance_DRAM_Scaling/7461695/1.
Computer Architecture News. [36] Yoongu Kim, Dongsu Han, Onur Mutlu, and Mor Harchol-Balter. 2010. ATLAS:
[7] Jens Axboe. 2024. axboe/fio: Flexible I/O Tester. https://github.com/axboe/fio. A Scalable and High-Performance Scheduling Algorithm for Multiple Memory
[8] Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP Bench- Controllers. In IEEE HPCA.
mark Suite. http://arxiv.org/abs/1508.03619 [37] Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter. 2010.
[9] Qizhe Cai, Mina Tahmasbi Arashloo, and Rachit Agarwal. 2022. dcPIM: Near- Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access
Optimal Proactive Datacenter Transport. In ACM SIGCOMM. Behavior. In IEEE MICRO.
[10] Qizhe Cai, Shubham Chaudhary, Midhul Vuppalapati, Jaehyun Hwang, and Rachit [38] HT Kung and Alan Chapman. 1993. The FCVC (Flow-Controlled Virtual Chan-
Agarwal. 2021. Understanding Host Network Stack Overheads. In ACM SIG- nels) Proposal for ATM Networks: A Summary. In IEEE ICNP.
COMM. [39] NT Kung and Robert Morris. 1995. Credit-Based Flow Control for ATM Networks.
[11] Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, In IEEE Network.
and Kostas Tzoumas. 2015. Apache Flink: Stream and Batch Processing in a [40] Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. 2012. GraphChi: Large-Scale
Single Engine. In IEEE Data Engineering Bulletin. Graph Computation on Just a PC. In USENIX OSDI.
[12] Justin Castilla. 2024. Clustering In Redis. https://developer.redis.com/operate/ [41] Chang Joo Lee, Veynu Narasiman, Eiman Ebrahimi, Onur Mutlu, and Yale N.
redis-at-scale/scalability/lustering-in-redis/. Patt. 2010. DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused
[13] Robert Cole, David Shur, and Curtis Villamizar. 1996. IP Over ATM: A Framework Interference in Memory Systems. https://utw10235.utweb.utexas.edu/people/cjlee/
Document. https://datatracker.ietf.org/doc/html/rfc1932. TR-HPS-2010-002.pdf.
[14] Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Process- [42] Qiang Li, Qiao Xiang, Yuxin Wang, Haohao Song, Ridi Wen, Wenhui Yao,
ing on Large Clusters. In Communications of the ACM. Yuanyuan Dong, Shuqi Zhao, Shuo Huang, Zhaosheng Zhu, Huayong Wang,
[15] Travis Downs. 2018. It’s not write combining. https://github.com/Kobzol/ Shanyang Liu, Lulu Chen, Zhiwu Wu, Haonan Qiu, Derui Liu, Gexiao Tian, Chao
hardware-effects/issues/1. Han, Shaozong Liu, Yaohui Wu, Zicheng Luo, Yuchao Shao, Junping Wu, Zheng
[16] Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N Patt. 2010. Fairness Cao, Zhongjie Wu, Jiaji Zhu, Jinbo Wu, Jiwu Shu, and Jiesheng Wu. 2023. More
via Source Throttling: A Configurable and High-Performance Fairness Substrate Than Capacity: Performance-oriented Evolution of Pangu in Alibaba. In USENIX
for Multi-core Memory Systems. In ACM SIGPLAN Notices. FAST.
[17] Franz Färber, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg, [43] Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan Zhuang, Fei Feng, Lingbo Tang,
and Wolfgang Lehner. 2012. SAP HANA Database: Data Management for Modern Zheng Cao, Ming Zhang, Frank Kelly, Mohammad Alizadeh, et al. 2019. HPCC:
Business Applications. In ACM SIGMOD Record. High Precision Congestion Control. In ACM SIGCOMM.
[18] Alireza Farshin, Amir Roozbeh, Gerald Q Maguire Jr, and Dejan Kostic. 2020. [44] Kefei Liu, Zhuo Jiang, Jiao Zhang, Haoran Wei, Xiaolong Zhong, Lizhuang
Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Tan, Tian Pan, and Tao Huang. 2023. Hostping: Diagnosing Intra-Host Network
Multi-Hundred-Gigabit Networks. In USENIX ATC. Bottlenecks in RDMA Servers. In USENIX NSDI.
[19] Cache Forge. 2024. memcached - A Distributed Memory Object Caching System. [45] Grzegorz Malewicz, Matthew H Austern, Aart JC Bik, James C Dehnert, Ilan Horn,
https://memcached.org/. Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A System for Large-Scale
[20] Henry J Fowler. 1995. TMN-Based Broadband ATM Network Management. In Graph Processing. In ACM SIGMOD.
IEEE Communications Magazine. [46] John D McCalpin. 1995. STREAM: Sustainable Memory Bandwidth in High
[21] Alex Galis, Dieter Gantenbein, Stefan Covaci, Carlo Bianza, Fotis Karayannis, and Performance Computers. https://www.cs.virginia.edu/stream/.
George Mykoniatis. 1996. Toward Multidomain Integrated Network Management [47] David J Miller, Philip M Watts, and Andrew W Moore. 2009. Motivating Fu-
for ATM and SDH Networks. In Broadband Strategies and Technologies for Wide ture Interconnects: A Differential Measurement Analysis of PCIe Latency. In
Area and Local Access Networks. ACM/IEEE ANCS.
[22] Peter X Gao, Akshay Narayan, Gautam Kumar, Rachit Agarwal, Sylvia Ratnasamy, [48] Thomas Moscibroda and Onur Mutlu. 2008. Distributed Order Scheduling and its
and Scott Shenker. 2015. pHost: Distributed Near-Optimal Datacenter Transport Application to Multi-Core DRAM Controllers. In ACM PODC.
Over Commodity Network Fabric. In ACM CoNEXT. [49] Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kan-
[23] Saugata Ghose, Hyodong Lee, and José F Martínez. 2013. Improving Memory demir, and Thomas Moscibroda. 2011. Reducing Memory Interference in Mul-
Scheduling via Processor-Side Load Criticality Information. In ACM/IEEE ISCA. ticore Systems via Application-Aware Memory Channel Partitioning. In IEEE
[24] Joseph E Gonzalez, Reynold S Xin, Ankur Dave, Daniel Crankshaw, Michael J MICRO.
Franklin, and Ion Stoica. 2014. GraphX: Graph Processing in a Distributed [50] Onur Mutlu and Thomas Moscibroda. 2007. Stall-Time Fair Memory Access
Dataflow Framework. In USENIX OSDI. Scheduling for Chip Multiprocessors. In IEEE MICRO.
[25] Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew W [51] Onur Mutlu and Thomas Moscibroda. 2008. Parallelism-Aware Batch Schedul-
Moore, Gianni Antichi, and Marcin Wójcik. 2017. Re-Architecting Datacenter ing: Enhancing both Performance and Fairness of Shared DRAM Systems. In
Networks and Stacks for Low Latency and High Performance. In ACM SIG- ACM/IEEE ISCA.
COMM. [52] Thomas Moscibroda Onur Mutlu. 2007. Memory Performance Attacks: Denial of
[26] Shuihai Hu, Wei Bai, Gaoxiong Zeng, Zilong Wang, Baochen Qiao, Kai Chen, Memory Service in Multi-Core Systems. In USENIX Security.
Kun Tan, and Yi Wang. 2020. Aeolus: A Building Block for Proactive Transport [53] Kyle J Nesbit, Nidhi Aggarwal, James Laudon, and James E Smith. 2006. Fair
in Datacenters. In ACM SIGCOMM. Queuing Memory Systems. In IEEE MICRO.
[27] Stephen Ibanez, Alex Mallery, Serhat Arslan, Theo Jepsen, Muhammad Shahbaz, [54] Rolf Neugebauer, Gianni Antichi, José Fernando Zazo, Yury Audzevich, Sergio
Changhoon Kim, and Nick McKeown. 2021. The NanoPU: A Nanosecond López-Buedo, and Andrew W Moore. 2018. Understanding PCIe Performance
Network Stack for Datacenters. In USENIX OSDI. for End Host Networking. In ACM SIGCOMM.
[28] Intel. 2012. Intel Data Direct I/O Technology (Intel DDIO): A [55] George P Nychis, Chris Fallin, Thomas Moscibroda, Onur Mutlu, and Srinivasan
Primer. https://www.intel.com/content/dam/www/public/us/en/documents/ Seshan. 2012. On-Chip Networks From a Networking Perspective: Congestion
technology-briefs/data-direct-i-o-technology-brief.pdf. and Scalability in Many-Core Interconnects. In ACM SIGCOMM.
[29] Intel. 2017. Intel Xeon Processor Scalable Memory Family Uncore Performance [56] Peter Pessl, Daniel Gruss, Clémentine Maurice, Michael Schwarz, and Stefan
Monitoring. https://kib.kiev.ua/x86docs/Intel/PerfMon/336274-001.pdf. Mangard. 2016. DRAMA: Exploiting DRAM Addressing for Cross-CPU Attacks.
[30] Intel. 2023. Intel 64 and IA-32 Architectures Software Developer’s Manual. In USENIX Security.
https://cdrdv2.intel.com/v1/dl/getContent/671200. [57] Redis. 2024. Redis. http://www.redis.io.
[58] Redis. 2024. Redis Benchmark. https://redis.io/docs/management/optimization/ [67] James Tuck, Luis Ceze, and Josep Torrellas. 2006. Scalable Cache Miss Handling
benchmarks/. for High Memory-Level Parallelism. In IEEE MICRO.
[59] Bryan Spry, Nagi Aboulenein, and Steve Kulick. 2008. United States Patent [68] Midhul Vuppalapati, Saksham Agarwal, Henry Schuh, Baris Kasikci, Arvind
Application Publication: Mechanism for Write Optimization to a Memory De- Krishnamurthy, and Rachit Agarwal. 2024. Understanding The Host Network
vice. https://patentimages.storage.googleapis.com/53/bf/04/667faa6c4e5278/ (Technical Report). https://github.com/host-architecture/understanding-the-host-
US20080162799A1.pdf. network.
[60] Michael Stonebraker and Ariel Weisberg. 2013. The VoltDB Main Memory [69] Yifan Yuan, Jinghan Huang, Yan Sun, Tianchen Wang, Jacob Nelson, Dan RK
DBMS. In Data Engineering Bulletin. Ports, Yipeng Wang, Ren Wang, Charlie Tai, and Nam Sung Kim. 2023.
[61] Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, Ana Klimovic, Adrian Schuep- RAMBDA: RDMA-Driven Acceleration Framework for Memory-Intensive 𝜇 s-
bach, and Bernard Metzler. 2019. Unification of Temporary Storage in the NodeK- scale Datacenter Applications. In IEEE HPCA.
ernel Architecture. In USENIX ATC. [70] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and
[62] Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, and Onur Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In USENIX
Mutlu. 2014. The Blacklisting Memory Scheduler: Achieving High Performance HotCloud.
and Fairness at Low Cost. In IEEE ICCD. [71] Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang. 2000. A Permutation-Based
[63] Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, and Onur Page Interleaving Scheme to Reduce Row-Buffer Conflicts and Exploit Data
Mutlu. 2015. The Application Slowdown Model: Quantifying and Controlling the Locality. In IEEE MICRO.
Impact of Inter-Application Interference at Shared Caches and Main Memory. In [72] Mark Zhao, Niket Agarwal, Aarti Basant, Buğra Gedik, Satadru Pan, Mustafa
IEEE MICRO. Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu, Sundaram
[64] Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, and Onur Mutlu. Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean Wu,
2013. MISE: Providing Performance Predictability and Improving Fairness in Christos Kozyrakis, and Parik Pol. 2022. Understanding Data Storage and Inges-
Shared Main Memory Systems. In IEEE HPCA. tion for Large-Scale Deep Recommendation Model Training: Industrial Product.
[65] Mark Sutherland, Siddharth Gupta, Babak Falsafi, Virendra Marathe, Dionisios In ACM/IEEE ISCA.
Pnevmatikatos, and Alexandros Daglis. 2020. The NEBULA RPC-Optimized [73] Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong Guo, Marina Lipshteyn,
Architecture. In ACM/IEEE ISCA. Yehonatan Liron, Jitendra Padhye, Shachar Raindel, Mohamad Haj Yahia, and
[66] Amin Tootoonchian, Aurojit Panda, Chang Lan, Melvin Walls, Katerina Argyraki, Ming Zhang. 2015. Congestion Control for Large-Scale RDMA Deployments. In
Sylvia Ratnasamy, and Scott Shenker. 2018. ResQ: Enabling SLOs in Network ACM SIGCOMM.
Function Virtualization. In USENIX NSDI.

You might also like