The Wayback Machine - https://web.archive.org/web/20091108151203/http://developer.amd.com:80/pages/1162007106.aspx
AMD Logo AMD Developer Central

NUMA optimization in Windows Applications 

Skip Navigation LinksHome
Michael Wall, Senior Member of Technical Staff, AMD  1/16/2007 

Microprocessors have been reaching higher performance levels with each new design, and higher performance typically involves greater demands on the memory system. This effect is compounded by the new trend toward multi-core processor designs, where multiple execution threads are making memory requests. In multi-processor workstations and servers, the old single-bus system architecture does not scale up adequately to meet these growing memory performance requirements.

AMD has mitigated the problem by integrating the memory controller onto the processor. As more processors are added to a system, more memory controllers are also added. Memory bandwidth scales up naturally to feed the multiple processors. Software developers can maximize the performance benefits of this NUMA (Non-Uniform Memory Access) architecture, by following a few simple rules of thumb.

This diagram illustrates the important concepts of NUMA. Each processor package is a NUMA node, which includes one or more cores. The cores within a NUMA node all share an on-chip memory controller. Memory attached to the local controller can be thought of as local memory, and the remainder of system memory is non-local memory. Every NUMA node can access all the memory in the system, but access to local memory is slightly faster. More importantly, if each node is making heavy use of local memory, all memory banks are well utilized and all memory controllers are doing useful work concurrently, reducing latencies and increasing overall performance. Also, the HyperTransport™ links are free to carry I/O traffic instead of shuttling lots of memory data between the NUMA nodes. NUMA performance is all about maximizing concurrency.

An operating system that supports NUMA will automatically try to allocate local memory for each process that is running within a single NUMA node. But what about multi-threaded processes that span multiple NUMA nodes? How can developers ensure that local memory is optimally used by these threads?

The main concept for developers in NUMA optimization is affinity. Think of affinity as a form of attachment. There are two kinds of affinity: memory affinity, and thread affinity. Memory affinity means a certain range of memory addresses is physically mapped to the local memory bank in a particular NUMA node. Thread affinity means a particular thread will only run on a certain subset of cores, typically including only those cores that reside within a single NUMA node. By explicitly using affinity, developers can ensure each thread is accessing local data that is stored in the local memory bank, thereby improving overall performance.

Windows provides APIs for detecting system NUMA topology and for setting thread affinity. Memory affinity can also be indirectly controlled using these APIs. This Visual Studio 2005 demo project shows how; please download it now and look at the code:

http://developer.amd.com/assets/numanuma.zip

The demo project shows three different ways to implement a simple, multi-threaded memory exerciser. It shows how to spawn threads and set thread affinity using the simple OpenMP method, and also using the Windows thread APIs. It also demonstrates how to allocate memory with controlled affinity in each case. Comments in the code explain briefly what it's doing, and the key concepts are described in more detail below.

Windows uses the notion of an affinity mask for setting thread affinity. The mask is simply a value that is associated with the thread, and contains one bit per core in the system. If the bit is set, the corresponding core is allowed to execute the thread. If the bit is cleared, the corresponding core is excluded from running the thread. See MSDN for complete documentation on the SetThreadAffinityMask() function.

For example, a thread's affinity mask might set all the bits corresponding to a particular NUMA node's cores, enabling that thread to run anywhere within the NUMA node. GetNumaNodeProcessorMask() can be used to learn which cores are included in a given NUMA node, see MSDN for details. Also see GetNumaHighestNodeNumber() and GetNumaProcessorNode() for help in detecting total system topology. Windows Vista™ and Longhorn Server also support the function GetLogicalProcessorInformation() which returns a wealth of system configuration information.

Once a thread's affinity has been set, the thread can make calls to memory allocation functions (e.g. VirtualAlloc), and those calls will return memory blocks that reside in the thread's local NUMA memory bank, assuming memory space is available. Windows Vista and Longhorn Server also support VirtualAllocExNuma() which is a way to allocate on a particular NUMA node.

There is one subtle aspect of memory affinity in Windows®. Windows XP and Server 2003 use a "lazy commit" policy, where each memory page gets physically committed on the initial NUMA node to access it, rather than when the memory block is allocated. In contrast, Windows Vista and Longhorn keep track of a thread’s preferred node, and commit the memory on that particular node when it is accessed – even if the thread is running on a different node.

Those are all the basic items a developer needs, to start implementing NUMA-optimized code. More details are shown in the demo project, and MSDN describes the complete set of API calls to support threads and NUMA.

More articles about NUMA can be found here:

Frequently Asked Questions: NUMA, SMP and AMD's Direct Connect Architecture

AMD's New Designs on Software: NUMA

NUMA, HyperTransport, 64-Bit Windows, and You

A NUMA API for LINUX

Performance Guidelines for AMD Athlon™ and AMD Opteron™ ccNUMA Multiprocessor Systems

System configuration note: Windows XP Professional x64 Edition, Windows Server 2003, and Windows Vista and Longhorn all support NUMA. Any multi-socket AMD-based computer is a NUMA machine, but some BIOS versions allow "node memory interleave" which effectively shuffles the address space across nodes and disables NUMA; this BIOS option should be disabled to allow NUMA support.

Back to top
© 2009 Advanced Micro Devices, Inc. AMD, the AMD Arrow logo, AMD Opteron, AMD Athlon, AMD Turion, AMD Sempron, AMD LIVE!, and combinations thereof, are trademarks of Advanced Micro Devices, Inc. Microsoft and Windows are registered trademarks of Microsoft Corporation in the United States and/or other jurisdictions. Linux is a registered trademark of Linus Torvalds. Other names are for informational purposes only and may be trademarks of their respective owners.

This website may be linked to other websites which are not in the control of and are not maintained by AMD. AMD is not responsible for the content of those sites. AMD provides these links to you only as a convenience, and the inclusion of any link to such sites does not imply endorsement by AMD of those sites. AMD reserves the right to terminate any link or linking program at any time.