0% found this document useful (0 votes)
563 views

ANSYS Mechanical APDL Parallel Processing Guide

Uploaded by

V Caf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
563 views

ANSYS Mechanical APDL Parallel Processing Guide

Uploaded by

V Caf
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Parallel Processing Guide

ANSYS, Inc. Release 2020 R1


Southpointe January 2020
2600 ANSYS Drive
Canonsburg, PA 15317 ANSYS, Inc. and
[email protected] ANSYS Europe,
Ltd. are UL
http://www.ansys.com registered ISO
(T) 724-746-3304 9001: 2015
(F) 724-514-9494 companies.
Copyright and Trademark Information

© 2020 ANSYS, Inc. Unauthorized use, distribution or duplication is prohibited.

ANSYS, ANSYS Workbench, AUTODYN, CFX, FLUENT and any and all ANSYS, Inc. brand, product, service and feature
names, logos and slogans are registered trademarks or trademarks of ANSYS, Inc. or its subsidiaries located in the
United States or other countries. ICEM CFD is a trademark used by ANSYS, Inc. under license. CFX is a trademark
of Sony Corporation in Japan. All other brand, product, service and feature names or trademarks are the property
of their respective owners. FLEXlm and FLEXnet are trademarks of Flexera Software LLC.

Disclaimer Notice

THIS ANSYS SOFTWARE PRODUCT AND PROGRAM DOCUMENTATION INCLUDE TRADE SECRETS AND ARE CONFID-
ENTIAL AND PROPRIETARY PRODUCTS OF ANSYS, INC., ITS SUBSIDIARIES, OR LICENSORS. The software products
and documentation are furnished by ANSYS, Inc., its subsidiaries, or affiliates under a software license agreement
that contains provisions concerning non-disclosure, copying, length and nature of use, compliance with exporting
laws, warranties, disclaimers, limitations of liability, and remedies, and other provisions. The software products
and documentation may be used, disclosed, transferred, or copied only in accordance with the terms and conditions
of that software license agreement.

ANSYS, Inc. and ANSYS Europe, Ltd. are UL registered ISO 9001: 2015 companies.

U.S. Government Rights

For U.S. Government users, except as specifically granted by the ANSYS, Inc. software license agreement, the use,
duplication, or disclosure by the United States Government is subject to restrictions stated in the ANSYS, Inc.
software license agreement and FAR 12.212 (for non-DOD licenses).

Third-Party Software

See the legal information in the product help files for the complete Legal Notice for ANSYS proprietary software
and third-party software. If you are unable to access the Legal Notice, contact ANSYS, Inc.

Published in the U.S.A.


Table of Contents
1. Overview of Parallel Processing .............................................................................................................. 1
1.1. Parallel Processing Terminolgy .......................................................................................................... 1
1.1.1. Hardware Terminology ............................................................................................................. 1
1.1.2. Software Terminology .............................................................................................................. 2
1.2. HPC Licensing ................................................................................................................................... 3
2. Using Shared-Memory ANSYS ................................................................................................................ 5
2.1. Activating Parallel Processing in a Shared-Memory Architecture ........................................................ 5
2.1.1. System-Specific Considerations ............................................................................................... 6
2.2. Troubleshooting ............................................................................................................................... 6
3. GPU Accelerator Capability ..................................................................................................................... 9
3.1. Activating the GPU Accelerator Capability ....................................................................................... 10
3.2. Supported Analysis Types and Features ........................................................................................... 10
3.2.1. NVIDIA GPU Hardware ............................................................................................................ 11
3.2.1.1. Supported Analysis Types .............................................................................................. 11
3.2.1.2. Supported Features ....................................................................................................... 12
3.3. Troubleshooting ............................................................................................................................. 12
4. Using Distributed ANSYS ...................................................................................................................... 15
4.1. Configuring Distributed ANSYS ....................................................................................................... 17
4.1.1. Prerequisites for Running Distributed ANSYS .......................................................................... 17
4.1.1.1. MPI Software ................................................................................................................. 18
4.1.1.2. Installing the Software ................................................................................................... 18
4.1.2. Setting Up the Cluster Environment for Distributed ANSYS ...................................................... 19
4.1.2.1. Optional Setup Tasks ..................................................................................................... 20
4.1.2.2. Using the mpitest Program ........................................................................................ 21
4.1.2.3. Interconnect Configuration ............................................................................................ 22
4.2. Activating Distributed ANSYS ......................................................................................................... 22
4.2.1. Starting Distributed ANSYS via the Launcher .......................................................................... 23
4.2.2. Starting Distributed ANSYS via Command Line ....................................................................... 24
4.2.3. Starting Distributed ANSYS via the HPC Job Manager .............................................................. 25
4.2.4. Starting Distributed ANSYS in the Mechanical Application (via ANSYS Workbench) .................. 25
4.2.5. Using MPI Files ....................................................................................................................... 26
4.2.6. Directory Structure Across Machines ....................................................................................... 27
4.3. Supported Analysis Types and Features ........................................................................................... 27
4.3.1. Supported Analysis Types ....................................................................................................... 27
4.3.2. Supported Features ................................................................................................................ 28
4.4. Understanding the Working Principles and Behavior of Distributed ANSYS ....................................... 30
4.4.1. Differences in General Behavior ............................................................................................. 30
4.4.2. Differences in Solution Processing .......................................................................................... 33
4.4.3. Differences in Postprocessing ................................................................................................. 35
4.4.4. Restarts in Distributed ANSYS ................................................................................................. 35
4.4.4.1. Procedure 1 - Use the Same Number of Cores ................................................................. 35
4.4.4.2. Procedure 2 - Use a Different Number of Cores ............................................................... 37
4.4.4.3. Additional Considerations for the Restart ....................................................................... 38
4.5. Example Problems .......................................................................................................................... 38
4.5.1. Example: Running Distributed ANSYS on Linux ....................................................................... 39
4.5.2. Example: Running Distributed ANSYS on Windows .................................................................. 40
4.6. Troubleshooting ............................................................................................................................. 41
4.6.1. Setup and Launch Issues ........................................................................................................ 41
4.6.2. Stability Issues ....................................................................................................................... 43
4.6.3. Solution and Performance Issues ............................................................................................ 44

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. iii
Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
iv of ANSYS, Inc. and its subsidiaries and affiliates.
List of Tables
4.1. Parallel Capability in Shared-Memory and Distributed ANSYS ................................................................ 16
4.2. Platforms and MPI Software .................................................................................................................. 18
4.3. Required Files for Multiframe Restart - Procedure 1 ................................................................................ 36
4.4. Required Files for Multiframe Restart - Procedure 2 ................................................................................ 37

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. v
Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
vi of ANSYS, Inc. and its subsidiaries and affiliates.
Chapter 1: Overview of Parallel Processing
Solving a large model with millions of DOFs or a medium-sized model with nonlinearities that needs
many iterations to reach convergence can require many CPU hours. To decrease simulation time, ANSYS,
Inc. offers different parallel processing options that increase the model-solving power of ANSYS products
by using multiple processors (also known as cores). The following three parallel processing capabilities
are available:

• Shared-memory parallel processing (shared-memory ANSYS) (p. 5)

• Distributed-memory parallel processing (Distributed ANSYS) (p. 15)

• GPU acceleration (a type of shared-memory parallel processing) (p. 9)

Multicore processors, and thus the ability to use parallel processing, are now widely available on all
computer systems, from laptops to high-end servers. The benefits of parallel processing are compelling
but are also among the most misunderstood. This chapter explains the two types of parallel processing
available in ANSYS and also discusses the use of GPUs (considered a form of shared-memory parallel
processing) and how they can further accelerate the time to solution.

Currently, the default scheme is to use two cores with distributed-memory parallelism. For many of the
computations involved in a simulation, the speedups obtained from parallel processing are nearly linear
as the number of cores is increased, making very effective use of parallel processing. However, the total
benefit (measured by elapsed time) is problem dependent and is influenced by many different factors.

No matter what form of parallel processing is used, the maximum benefit attained will always be limited
by the amount of work in the code that cannot be parallelized. If just 20 percent of the runtime is spent
in nonparallel code, the maximum theoretical speedup is only 5X, assuming the time spent in parallel
code is reduced to zero. However, parallel processing is still an essential component of any HPC system;
by reducing wall clock elapsed time, it provides significant value when performing simulations.

Distributed ANSYS, shared-memory ANSYS, and GPU acceleration can require HPC licenses. You can use
up to four CPU cores or a combination of four CPUs and GPUs without using any HPC licenses. Additional
licenses will be needed to run with more than four. See HPC Licensing (p. 3) for more information.

1.1. Parallel Processing Terminolgy


It is important to fully understand the terms we use, both relating to our software and to the physical
hardware. The terms shared-memory ANSYS and Distributed ANSYS refer to our software offerings, which
run on shared-memory or distributed-memory hardware configurations. The term GPU accelerator cap-
ability refers to our software offering which allows the program to take advantage of certain GPU
(graphics processing unit) hardware to accelerate the speed of the solver computations.

1.1.1. Hardware Terminology


The following terms describe the hardware configurations used for parallel processing:

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 1
Overview of Parallel Processing

Shared-memory hardware This term refers to a physical hardware configuration in which a single
shared-memory address space is accessible by multiple CPU cores; each
CPU core “shares” the memory with the other cores. A common example
of a shared-memory system is a Windows desktop machine or worksta-
tion with one or two multicore processors.

Distributed-memory hard- This term refers to a physical hardware configuration in which multiple
ware machines are connected together on a network (i.e., a cluster). Each
machine on the network (that is, each compute node on the cluster)
has its own memory address space. Communication between machines
is handled by interconnects (Gigabit Ethernet, Infiniband, etc.).

Virtually all clusters involve both shared-memory and distributed-


memory hardware. Each compute node on the cluster typically
contains at least two or more CPU cores, which means there is a
shared-memory environment within a compute node. The distrib-
uted-memory environment requires communication between the
compute nodes involved in the cluster.

GPU hardware A graphics processing unit (GPU) is a specialized microprocessor that


off-loads and accelerates graphics rendering from the microprocessor.
Their highly parallel structure makes GPUs more effective than general-
purpose CPUs for a range of complex algorithms. In a personal com-
puter, a GPU on a dedicated video card is more powerful than a GPU
that is integrated on the motherboard.

1.1.2. Software Terminology


The following terms describe our software offerings for parallel processing:

Shared-memory ANSYS This term refers to running across multiple cores on a single machine
(e.g., a desktop workstation or a single compute node of a cluster).
Shared-memory parallelism is invoked, which allows each core involved
to share data (or memory) as needed to perform the necessary parallel
computations. When run within a shared-memory architecture, most
computations in the solution phase and many pre- and postprocessing
operations are performed in parallel. For more information, see Using
Shared-Memory ANSYS (p. 5).

Distributed ANSYS This term refers to running across multiple cores on a single machine
(e.g., a desktop workstation or a single compute node of a cluster) or
across multiple machines (e.g., a cluster). Distributed-memory parallelism
is invoked, and each core communicates data needed to perform the
necessary parallel computations through the use of MPI (Message
Passing Interface) software. With Distributed ANSYS, all computations
in the solution phase are performed in parallel (including the stiffness
matrix generation, linear equation solving, and results calculations).
Pre- and postprocessing do not make use of the distributed-memory
parallel processing; however, these steps can make use of shared-
memory parallelism. See Using Distributed ANSYS (p. 15) for more
details.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
2 of ANSYS, Inc. and its subsidiaries and affiliates.
HPC Licensing

GPU accelerator capability This capability takes advantage of the highly parallel architecture of
the GPU hardware to accelerate the speed of solver computations and,
therefore, reduce the time required to complete a simulation. Some
computations of certain equation solvers can be off-loaded from the
CPU(s) to the GPU, where they are often executed much faster. The
CPU core(s) will continue to be used for all other computations in and
around the equation solvers. For more information, see GPU Accelerator
Capability (p. 9).

Shared-memory ANSYS can only be run on shared-memory hardware. However, Distributed ANSYS
can be run on both shared-memory hardware or distributed-memory hardware. While both forms of
hardware can achieve a significant speedup with Distributed ANSYS, only running on distributed-
memory hardware allows you to take advantage of increased resources (for example, available memory
and disk space, as well as memory and I/O bandwidths) by using multiple machines. The GPU accel-
erator capability can be used with either shared-memory ANSYS or Distributed ANSYS.

1.2. HPC Licensing


ANSYS, Inc. offers the following high performance computing license options:

ANSYS HPC - These physics-neutral licenses can be used to run a single analysis across multiple
processors (cores).
ANSYS HPC Packs - These physics-neutral licenses share the same characteristics of the ANSYS HPC
licenses, but are combined into predefined packs to give you greater value and scalability.

For detailed information on these HPC license options, see HPC Licensing in the ANSYS Licensing Guide.

The HPC license options cannot be combined with each other in a single solution; for example, you
cannot use both ANSYS HPC and ANSYS HPC Packs in the same analysis solution.

The order in which HPC licenses are used is specified by your user license preferences setting. See
Specify Product Order in the ANSYS Licensing Guide for more information on setting user license product
order.

You can choose a particular HPC license by using the Preferred Parallel Feature command line option.
The format is ansys201 -ppf <license feature name>, where <license feature name>
is the name of the HPC license option that you want to use. This option forces Mechanical APDL to use
the specified license feature for the requested number of parallel cores or GPUs. If the license feature
is entered incorrectly or the license feature is not available, a license failure occurs.

Both Distributed ANSYS and shared-memory ANSYS allow you to use four CPU cores without using any
HPC licenses. ANSYS HPC licenses add cores to this base functionality, while the ANSYS HPC Pack licenses
function independently of the four included cores.

In a similar way, you can use up to four CPU cores and GPUs combined without any HPC licensing (for
example, one CPU and three GPUs). The combined number of CPU cores and GPUs used cannot exceed
the task limit allowed by your specific license configuration.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 3
Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
4 of ANSYS, Inc. and its subsidiaries and affiliates.
Chapter 2: Using Shared-Memory ANSYS
When running a simulation, the solution time is typically dominated by three main parts: the time spent
to create the element matrices and form the global matrices, the time to solve the linear system of
equations, and the time spent calculating derived quantities (such as stress and strain) and other reques-
ted results for each element.

Shared-memory ANSYS can run a solution over multiple cores on a single machine. When using shared-
memory parallel processing, you can reduce each of the three main parts of the overall solution time
by using multiple cores. However, this approach is often limited by the memory bandwidth; you typically
see very little reduction in solution time beyond four cores.

The main program functions that run in parallel on shared-memory hardware are:

• Solvers such as the Sparse, PCG, ICCG, Block Lanczos, PCG Lanczos, Supernode, and Subspace running over
multiple processors but sharing the same memory address. These solvers typically have limited scalability
when used with shared-memory parallelism. In general, very little reduction in time occurs when using more
than four cores.

• Forming element matrices and load vectors.

• Computing derived quantities and other requested results for each element.

• Pre- and postprocessing functions such as graphics, selecting, sorting, and other data and compute intensive
operations.

2.1. Activating Parallel Processing in a Shared-Memory Architecture


1. By default, shared-memory ANSYS uses two cores and does not require any HPC licenses. Additional HPC
licenses are required to run with more than four cores. Several HPC license options are available. See HPC
Licensing (p. 3) for more information.

2. Open the Mechanical APDL Product Launcher:

Windows:
Start >Programs >ANSYS 2020 R1 >Mechanical APDL Product Launcher

Linux:
launcher201

3. Select the correct environment and license.

4. Go to the High Performance Computing Setup tab. Select Use Shared-Memory Parallel (SMP). Specify
the number of cores to use.

5. Alternatively, you can specify the number of cores to use via the -np command line option:

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 5
Using Shared-Memory ANSYS

ansys201 -smp -np N

where N represents the number of cores to use.

For large multiprocessor servers, ANSYS, Inc. recommends setting N to a value no higher than the
number of available cores minus one. For example, on an eight-core system, set N to 7. However,
on multiprocessor workstations, you may want to use all available cores to minimize the total
solution time. The program automatically limits the maximum number of cores used to be less
than or equal to the number of physical cores on the machine. This is done to avoid running the
program on virtual cores (e.g., by means of hyperthreading), which typically results in poor per-
core performance. For optimal performance, consider closing down all other applications before
launching ANSYS.

If you have more than one HPC license feature, you can use the -ppf command line option to
specify which HPC license to use for the parallel run. See HPC Licensing (p. 3) for more information.

6. If working from the launcher, click Run to launch ANSYS.

7. Set up and run your analysis as you normally would.

2.1.1. System-Specific Considerations


For shared-memory parallel processing, the number of cores that the program uses is limited to the
lesser of one of the following:

• The number of ANSYS HPC licenses available (plus the first four cores which do not require any licenses)

• The number of cores indicated via the -np command line argument

• The actual number of cores available

You can specify multiple settings for the number of cores to use during a session. However, ANSYS,
Inc. recommends that you issue the /CLEAR command before resetting the number of cores for
subsequent analyses.

2.2. Troubleshooting
This section describes problems which you may encounter while using shared-memory parallel processing
as well as methods for overcoming these problems. Some of these problems are specific to a particular
system, as noted.

Job fails with SIGTERM signal (Linux Only)


Occasionally, when running on Linux, a simulation may fail with the following message:“process killed
(SIGTERM)”. This typically occurs when computing the solution and means that the system has killed the
ANSYS process. The two most common occurrences are (1) ANSYS is using too much of the hardware re-
sources and the system has killed the ANSYS process or (2) a user has manually killed the ANSYS job (i.e.,
kill -9 system command). Users should check the size of job they are running in relation to the amount of
physical memory on the machine. Most often, decreasing the model size or finding a machine with more
RAM will result in a successful run.

Poor Speedup or No Speedup


As more cores are utilized, the runtimes are generally expected to decrease. The biggest relative gains are
typically achieved when using two cores compared to using a single core. When significant speedups are

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
6 of ANSYS, Inc. and its subsidiaries and affiliates.
Troubleshooting

not seen as additional cores are used, the reasons may involve both hardware and software issues. These
include, but are not limited to, the following situations.

Hardware
Oversubscribing hardware In a multiuser environment, this could mean that more physical cores
are being used by ANSYS simulations than are available on the machine. It could also mean that hyper-
threading is activated. Hyperthreading typically involves enabling extra virtual cores, which can
sometimes allow software programs to more effectively use the full processing power of the CPU.
However, for compute-intensive programs such as ANSYS, using these virtual cores rarely provides a
significant reduction in runtime. Therefore, it is recommended you disable hyperthreading; if hyper-
threading is enabled, it is recommended you do not exceed the number of physical cores.

Lack of memory bandwidth On some systems, using most or all of the available cores can
result in a lack of memory bandwidth. This lack of memory bandwidth can affect the overall
scalability of the ANSYS software.

Dynamic Processor Speeds Many new CPUs have the ability to dynamically adjust the clock
speed at which they operate based on the current workloads. Typically, when only a single core
is being used the clock speed can be significantly higher than when all of the CPU cores are
being utilized. This can have a negative effect on scalability as the per-core computational per-
formance can be much higher when only a single core is active versus the case when all of the
CPU cores are active.

Software
Simulation includes non-supported features The shared- and distributed-memory parallelisms
work to speed up certain compute-intensive operations in /PREP7, /SOLU and /POST1. However, not
all operations are parallelized. If a particular operation that is not parallelized dominates the simulation
time, then using additional cores will not help achieve a faster runtime.

Simulation has too few DOF (degrees of freedom) Some analyses (such as transient analyses)
may require long compute times, not because the number of DOF is large, but because a large
number of calculations are performed (i.e., a very large number of time steps). Generally, if the
number of DOF is relatively small, parallel processing will not significantly decrease the solution
time. Consequently, for small models with many time steps, parallel performance may be poor
because the model size is too small to fully utilize a large number of cores.

I/O cost dominates solution time For some simulations, the amount of memory required to
obtain a solution is greater than the physical memory (i.e., RAM) available on the machine. In
these cases, either virtual memory (i.e., hard disk space) is used by the operating system to hold
the data that would otherwise be stored in memory, or the equation solver writes extra files to
the disk to store data. In both cases, the extra I/O done using the hard drive can significantly
affect performance, making the I/O performance the main bottleneck to achieving optimal
performance. In these cases, using additional cores will typically not result in a significant reduc-
tion in overall time to solution.

Different Results Relative to a Single Core


Shared-memory parallel processing occurs in various preprocessing, solution, and postprocessing operations.
Operational randomness and numerical round-off inherent to parallelism can cause slightly different results
between runs on the same machine using the same number of cores or different numbers of cores. This
difference is often negligible. However, in some cases the difference is appreciable. This sort of behavior
is most commonly seen on nonlinear static or transient analyses which are numerically unstable. The more
numerically unstable the model is, the more likely the convergence pattern or final results will differ as
the number of cores used in the simulation is changed.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 7
Using Shared-Memory ANSYS

With shared-memory parallelism, you can use the PSCONTROL command to control which operations
actually use parallel behavior. For example, you could use this command to show that the element
matrix generation running in parallel is causing a nonlinear job to converge to a slightly different
solution each time it runs (even on the same machine with no change to the input data). This can
help isolate parallel computations which are affecting the solution while maintaining as much other
parallelism as possible to continue to reduce the time to solution.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
8 of ANSYS, Inc. and its subsidiaries and affiliates.
Chapter 3: GPU Accelerator Capability
In an effort to provide faster performance during solution, Mechanical APDL supports offloading key
solver computations onto graphics cards to accelerate those computations. Only high-end graphics
cards, the ones with the most amount of cores and memory, can be used to accelerate the solver
computations. For details on which GPU devices are supported and the corresponding driver versions,
see the GPU requirements outlined in the Windows Installation Guide and the Linux Installation Guide.

It is important to understand that a GPU does not replace the CPU core(s) on which a simulation typically
runs. One or more CPU cores must be used to run the Mechanical APDL program. The GPUs are used
in support of the CPU to process certain calculations. The CPU continues to handle most operations
and will automatically offload some of the time-intensive parallel operations performed by certain
equation solvers. These parallel solver operations can usually be performed much faster on the highly
parallel architecture of a GPU, thus accelerating these solvers and reducing the overall time to solution.

GPU acceleration can be used with both shared-memory parallel processing (shared-memory ANSYS)
and distributed-memory parallel processing (Distributed ANSYS). In shared-memory ANSYS, one or
multiple GPU accelerator devices can be utilized during solution. In Distributed ANSYS, one or multiple
GPU accelerator devices per machine or compute node can be utilized during solution.

As an example, when using Distributed ANSYS on a cluster involving eight compute nodes with each
compute node having two supported GPU accelerator devices, either a single GPU per node (a total of
eight GPU cards) or two GPUs per node (a total of sixteen GPU cards) can be used to accelerate the
solution. The GPU accelerator device usage must be consistent across all compute nodes. For example,
if running a simulation across all compute nodes, it is not possible to use one GPU for some compute
nodes and zero or two GPUs for the other compute nodes.

On machines containing multiple GPU accelerator devices, the program automatically selects the GPU
accelerator device (or devices) to be used for the simulation. The program cannot detect if a GPU device
is currently being used by other software, including another Mechanical APDL simulation. Therefore, in
a multiuser environment, users should be careful not to oversubscribe the GPU accelerator devices by
simultaneously launching multiple simulations that attempt to use the same GPU (or GPUs) to accelerate
the solution. For more information, see Oversubscribing GPU Hardware (p. 14) in the troubleshooting
discussion.

The GPU accelerator capability is only supported on the Windows 64-bit and Linux x64 platforms.

You can use up to four GPUs and CPUs combined without any HPC licensing (for example, one CPU
and three GPUs). To use more than four, you need one or more ANSYS HPC licenses or ANSYS HPC Pack
licenses. For more information see HPC Licensing in the ANSYS Licensing Guide.

The following GPU accelerator topics are available:


3.1. Activating the GPU Accelerator Capability
3.2. Supported Analysis Types and Features
3.3.Troubleshooting

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 9
GPU Accelerator Capability

3.1. Activating the GPU Accelerator Capability


Following is the general procedure to use the GPU accelerator capability:

1. Before activating the GPU accelerator capability, you must have at least one GPU card installed with
the proper driver level. You may also need some type of HPC license; see HPC licensing for details.

2. Open the Mechanical APDL Product Launcher.

Windows:
Start >Programs >ANSYS 2020 R1 >Mechanical APDL Product Launcher

Linux:
launcher201

3. Select the correct environment and license.

4. Go to the High Performance Computing Setup tab, select a GPU device from the GPU Accelerator
drop-down menu, and specify the number of GPU accelerator devices.

5. Alternatively, you can activate the GPU accelerator capability via the -acc command line option:
ansys201 -acc nvidia -na N

The -na command line option followed by a number (N) indicates the number of GPU accelerator
devices to use per machine or compute node. If only the -acc option is specified, the program uses
a single GPU device per machine or compute node by default (that is, -na 1).

If you have more than one HPC license feature, you can use the -ppf command line option
to specify which HPC license to use for the parallel run. See HPC Licensing (p. 3) for more in-
formation.

6. If working from the launcher, click Run to launch Mechanical APDL.

7. Set up and run your analysis as you normally would.

With the GPU accelerator capability, the acceleration obtained by using the parallelism on the GPU
hardware occurs only during the solution operations. Operational randomness and numerical round-off
inherent to any parallel algorithm can cause slightly different results between runs on the same machine
when using or not using the GPU hardware to accelerate the simulation.

The ACCOPTION command can also be used to control activation of the GPU accelerator capability.

3.2. Supported Analysis Types and Features


Some analysis types and features are not supported by the GPU accelerator capability. Supported
functionality also depends on the specified GPU hardware. The following section gives general guidelines
on what is and is not supported:
3.2.1. NVIDIA GPU Hardware

These are not comprehensive lists, but represent major features and capabilities found in the Mechan-
ical APDL program.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
10 of ANSYS, Inc. and its subsidiaries and affiliates.
Supported Analysis Types and Features

3.2.1. NVIDIA GPU Hardware


This section lists analysis capabilities that are supported by the GPU accelerator capability when using
NVIDIA GPU cards.

3.2.1.1. Supported Analysis Types


The following analysis types are supported and will use the GPU to accelerate the solution.

• Static linear or nonlinear analyses using the sparse, PCG, or JCG solver.

• Buckling analyses using the Block Lanczos or subspace eigensolver.

• Modal analyses using the Block Lanczos, subspace, PCG Lanczos, QR damped, unsymmetric, or
damped eigensolver.

• Harmonic analyses using the full method and the sparse solver.

• Transient linear or nonlinear analyses using the full method and the sparse, PCG, or JCG solver.

• Substructuring analyses, generation pass only, including the generation pass of component mode
synthesis (CMS) analyses.

In situations where the analysis type is not supported by the GPU accelerator capability, the solution
will continue but GPU acceleration will not be used.

Performance Issue for Some Solver/Hardware Combinations

When using the PCG or JCG solver, or the PCG Lanczos eigensolver, any of the recommended NVIDIA
GPU devices can be expected to achieve good performance.

When using the sparse solver or eigensolvers based on the sparse solver (for example, Block Lanczos
or subspace), only NVIDIA GPU devices with significant double precision performance (FP64) are
recommended in order to achieve good performance. These include the following models:

• NVIDIA Tesla Series (any model except the following: K10/M4/M40/M6/M60/P4/P40)

• NVIDIA Quadro GV100

• NVIDIA Quadro GP100

• NVIDIA Quadro K6000

Shared-Memory Parallel Behavior


For the sparse solver (and eigensolvers based on the sparse solver), if one or more GPUs are reques-
ted, only a single GPU is used no matter how many are requested.

For the PCG and JCG solvers (and eigensolvers based on the PCG solver), all requested GPUs are
used.

Distributed-Memory Parallel Behavior


For the sparse solver (and eigensolvers based on the sparse solver), if the number of GPUs exceeds
the number of processes (the -na value is greater than the -np value on the command line), the

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 11
GPU Accelerator Capability

number of GPUs used equals the -np value. If the number of GPUs is less than the number of
processes (-na is less than -np), all requested GPUs are used.

For the PCG and JCG solvers (and eigensolvers based on the PCG solver), if the number of GPUs
exceeds the number of processes (-na is greater than -np), all requested GPUs are used. If the
number of GPUs is less than the number of processes (-na is less than -np), all requested GPUs
are used.

3.2.1.2. Supported Features


As the GPU accelerator capability currently only pertains to the equation solvers, virtually all features
and element types are supported when using this capability with the supported equation solvers
listed in Supported Analysis Types (p. 11). A few limitations exist and are listed below. In these
situations, the solution will continue but GPU acceleration will not be used (unless otherwise noted):

• Partial pivoting is activated when using the sparse solver. This most commonly occurs when using
current technology elements with mixed u-P formulation, Lagrange multiplier based MPC184 ele-
ments, Lagrange multiplier based contact elements (TARGE169 through CONTA178), or certain
circuit elements (CIRCU94, CIRCU124).

• The memory saving option is activated (MSAVE,ON) when using the PCG solver. In this particular
case, the MSAVE option is turned off and GPU acceleration is used.

• Unsymmetric matrices when using the PCG solver.

• A non-supported equation solver is used (for example, ICCG, etc.).

3.3. Troubleshooting
This section describes problems which you may encounter while using the GPU accelerator capability,
as well as methods for overcoming these problems. Some of these problems are specific to a particular
system, as noted.

NVIDIA GPUs support various compute modes (for example, Exclusive thread, Exclusive process). Only
the default compute mode is supported. Using other compute modes may cause the program to fail
to launch.

To list the GPU devices installed on the machine, set the ANSGPU_PRINTDEVICES environment variable
to a value of 1. The printed list may or may not include graphics cards used for display purposes, along
with any graphics cards used to accelerate your simulation.

No Devices
Be sure that a recommended GPU device is properly installed and configured. Check the driver level to be
sure it is current or newer than the driver version supported for your particular device. (See the GPU re-
quirements outlined in the Windows Installation Guide and the Linux Installation Guide.)

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
12 of ANSYS, Inc. and its subsidiaries and affiliates.
Troubleshooting

When using NVIDIA GPU devices, use of the CUDA_VISIBLE_DEVICES environment variable can block
some or all of the GPU devices from being visible to the program. Try renaming this environment
variable to see if the supported devices can be used.

Note:

On Windows, the use of Remote Desktop may disable the use of a GPU device. Launching
Mechanical APDL through the ANSYS Remote Solve Manager (RSM) when RSM is installed
as a service may also disable the use of a GPU. In these two scenarios, the GPU Acceler-
ator Capability cannot be used. Using the TCC (Tesla Compute Cluster) driver mode, if
applicable, can circumvent this restriction.

No Valid Devices
A GPU device was detected, but it is not a recommended GPU device. Be sure that a recommended GPU
device is properly installed and configured. Check the driver level to be sure it is current or newer than
the supported driver version for your particular device. (See the GPU requirements outlined in the Windows
Installation Guide and the Linux Installation Guide.) Consider using the ANSGPU_OVERRIDE environment
variable to override the check for valid GPU devices.

When using NVIDIA GPU devices, use of the CUDA_VISIBLE_DEVICES environment variable can block
some or all of the GPU devices from being visible to the program. Try renaming this environment
variable to see if the supported devices can be used.

Poor Acceleration or No Acceleration


Simulation includes non-supported features A GPU device will only accelerate certain portions of a
simulation, mainly the solution time. If the bulk of the simulation time is spent outside of solution, the
GPU cannot have a significant effect on the overall analysis time. Even if the bulk of the simulation is spent
inside solution, you must be sure that a supported equation solver is utilized during solution and that no
unsupported options are used. Messages are printed in the output to alert users when a GPU is being used,
as well as when unsupported options/features are chosen which deactivate the GPU accelerator capability.

Simulation has too few DOF (degrees of freedom) Some analyses (such as transient analyses)
may require long compute times, not because the number of DOF is large, but because a large
number of calculations are performed (i.e., a very large number of time steps). Generally, if the
number of DOF is relatively small, GPU acceleration will not significantly decrease the solution time.
Consequently, for small models with many time steps, GPU acceleration may be poor because the
model size is too small to fully utilize a GPU.

Simulation does not fully utilize the GPU Only simulations that spend a lot of time performing
calculations that are supported on a GPU can expect to see significant speedups when a GPU is
used. Only certain computations are supported for GPU acceleration. Therefore, users should check
to ensure that a high percentage of the solution time was spent performing computations that
could possibly be accelerated on a GPU. This can be done by reviewing the equation solver statistics
files as described below. See Measuring Performance in the Performance Guide for more details on
the equation solver statistics files.

• PCG solver file: The .PCS file contains statistics for the PCG iterative solver. You should first
check to make sure that the GPU was utilized by the solver. This can be done by looking at the
line which begins with: “Number of cores used”. The string “GPU acceleration enabled” will be
added to this line if the GPU hardware was used by the solver. If this string is missing, the GPU
was not used for that call to the solver. Next, you should study the elapsed times for both the
“Preconditioner Factoring” and “Multiply With A22” computations. GPU hardware is only used to

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 13
GPU Accelerator Capability

accelerate these two sets of computations. The wall clock (or elapsed) times for these computations
are the areas of interest when determining how much GPU acceleration is achieved.

• Sparse solver files: The .DSP file contains statistics for the sparse direct solver. You should first
check to make sure that the GPU was utilized by the solver. This can be done by looking for the
following line: “GPU acceleration activated”. This line will be printed if the GPU hardware was
used. If this line is missing, the GPU was not used for that call to the solver. Next, you should
check the percentage of factorization computations (flops) which were accelerated on a GPU.
This is shown by the line: “percentage of GPU accelerated flops”. Also, you should look at the
time to perform the matrix factorization, shown by the line: “time (cpu & wall) for numeric factor”.
GPU hardware is only used to accelerate the matrix factor computations. These lines provide
some indication of how much GPU acceleration is achieved.

• Eigensolver files: The Block Lanczos and Subspace eigensolvers support the use of GPU devices;
however, no statistics files are written by these eigensolvers. The .PCS file is written for the PCG
Lanczos eigensolver and can be used as described above for the PCG iterative solver.

Using multiple GPU devices When using the sparse solver in a shared-memory parallel solution,
it is expected that running a simulation with multiple GPU devices will not improve performance
compared to running with a single GPU device. In a shared-memory parallel solution, the sparse
solver can only make use of one GPU device.

Oversubscribing GPU hardware The program automatically determines which GPU devices to
use. In a multiuser environment, this could mean that one or more of the same GPUs are picked
when multiple simulations are run simultaneously, thus oversubscribing the hardware.

• If only a single GPU accelerator device exists in the machine, then only a single user should attempt
to make use of it, much in the same way users should avoid oversubscribing their CPU cores.

• If multiple GPU accelerator devices exist in the machine, you can set the ANSGPU_DEVICE envir-
onment variable, in conjunction with the ANSGPU_PRINTDEVICES environment variable mentioned
above, to specify which particular GPU accelerator devices to use during the solution.

For example, consider a scenario where ANSGPU_PRINTDEVICES shows that four GPU
devices are available with device ID values of 1, 3, 5, and 7 respectively, and only the second
and third devices are supported for GPU acceleration. To select only the second supported
GPU device, set ANSGPU_DEVICE = 5. To select the first and second supported GPU devices,
set ANSGPU_DEVICE = 3:5.

Solver/hardware combination When using NVIDIA GPU devices, some solvers may not achieve
good performance on certain devices. For more information, see Performance Issue for Some
Solver/Hardware Combinations (p. 11).

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
14 of ANSYS, Inc. and its subsidiaries and affiliates.
Chapter 4: Using Distributed ANSYS
When running a simulation, the solution time is typically dominated by three main parts: the time spent
to create the element matrices and form the global matrices or global systems of equations, the time
to solve the linear system of equations, and the time spent calculating derived quantities (such as stress
and strain) and other requested results for each element.

The distributed-memory parallelism offered via Distributed ANSYS allows the entire solution phase to
run in parallel, including the stiffness matrix generation, linear equation solving, and results calculations.
As a result, a simulation using distributed-memory parallel processing usually achieves much faster
solution times than a similar run performed using shared-memory parallel processing (p. 5), particularly
at higher core counts.

Distributed ANSYS can run a solution over multiple cores on a single machine or on multiple machines
(that is, a cluster). It automatically decomposes the model into smaller domains, transfers the domains
to each core, solves each domain simultaneously, and creates a complete solution to the model. The
memory and disk space required to complete the solution can also be distributed over multiple machines.
By utilizing all of the resources of a cluster (computing power, RAM, memory and I/O bandwidth), dis-
tributed-memory parallel processing can be used to solve very large problems much more efficiently
compared to the same simulation run on a single machine.

Distributed ANSYS Behavior


Distributed ANSYS works by launching multiple ANSYS processes on either a single machine or on
multiple machines (as specified by one of the following command line options: -np, -machines, or -
mpifile). The machine that the distributed run is launched from is referred to as the master or host
machine (or in some cases, primary compute node), and the other machines are referred to as the slave
machines (or compute nodes). The first process launched on the master machine is referred to as the
master or host process; all other processes are referred to as the slave processes.

Each Distributed ANSYS process is essentially a running process of shared-memory ANSYS. These processes
are launched through the specified MPI software layer. The MPI software allows each Distributed ANSYS
process to communicate, or exchange data, with the other processes involved in the distributed simu-
lation.

Distributed ANSYS does not currently support all of the analysis types, elements, solution options, etc.
that are available with shared-memory ANSYS (see Supported Features (p. 28)). In some cases, Distributed
ANSYS stops the analysis to avoid performing an unsupported action. If this occurs, you must launch
shared-memory ANSYS to perform the simulation. In other cases, Distributed ANSYS will automatically
disable the distributed-memory parallel processing capability and perform the operation using shared-
memory parallelism. This disabling of the distributed-memory parallel processing can happen at various
levels in the program.

The master process handles the inputting of commands as well as all of the pre- and postprocessing
actions. Only certain commands (for example, the SOLVE command and supporting commands such
as /SOLU, FINISH, /EOF, /EXIT, and so on) are communicated to the slave processes for execution.
Therefore, outside of the SOLUTION processor (/SOLU), Distributed ANSYS behaves very similar to

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 15
Using Distributed ANSYS

shared-memory ANSYS. The master process works on the entire model during these pre- and postpro-
cessing steps and may use shared-memory parallelism to improve performance of these operations.
During this time, the slave processes wait to receive new commands from the master process.

Once the SOLVE command is issued, it is communicated to the slave processes and all Distributed
ANSYS processes become active. At this time, the program makes a decision as to which mode to use
when computing the solution. In some cases, the solution will proceed using only a distributed-memory
parallel (DMP) mode. In other cases, similar to pre- and postprocessing, the solution will proceed using
only a shared-memory parallel (SMP) mode. In a few cases, a mixed mode may be implemented which
tries to use as much distributed-memory parallelism as possible for maximum performance. These three
modes are described further below.

Pure DMP mode The simulation is fully supported by Distributed ANSYS, and distributed-memory
parallelism is used throughout the solution. This mode typically provides optimal performance in Dis-
tributed ANSYS.

Mixed mode The simulation involves a particular set of computations that is not supported by Dis-
tributed ANSYS. Examples include certain equation solvers and remeshing due to mesh nonlinear ad-
aptivity. In these cases, distributed-memory parallelism is used throughout the solution, except for the
unsupported set of computations. When that step is reached, the slave processes in Distributed ANSYS
simply wait while the master process uses shared-memory parallelism to perform the computations.
After the computations are finished, the slave processes continue to compute again until the entire
solution is completed.

Pure SMP mode The simulation involves an analysis type or feature that is not supported by Distributed
ANSYS. In this case, distributed-memory parallelism is disabled at the onset of the solution, and shared-
memory parallelism is used instead. The slave processes in Distributed ANSYS are not involved at all in
the solution but simply wait while the master process uses shared-memory parallelism to compute the
entire solution.

When using shared-memory parallelism inside of Distributed ANSYS (in mixed mode or SMP mode, in-
cluding all pre- and postprocessing operations), the master process will not use more cores on the
master machine than the total cores you specify to be used for the Distributed ANSYS solution. This is
done to avoid exceeding the requested CPU resources or the requested number of licenses.

The following table shows which steps, including specific equation solvers, can be run in parallel using
shared-memory ANSYS and Distributed ANSYS.

Table 4.1: Parallel Capability in Shared-Memory and Distributed ANSYS

Solvers/Feature Shared-Memory Distributed ANSYS


ANSYS
Sparse Y Y
PCG Y Y
ICCG Y Y [1]
JCG Y Y [1] [2]
QMR Y Y [1]
Block Lanczos eigensolver Y Y
PCG Lanczos eigensolver Y Y
Supernode eigensolver Y Y [1]
Subspace eigensolver Y Y

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
16 of ANSYS, Inc. and its subsidiaries and affiliates.
Configuring Distributed ANSYS

Solvers/Feature Shared-Memory Distributed ANSYS


ANSYS
Unsymmetric eigensolver Y Y
Damped eigensolver Y Y
QRDAMP eigensolver Y Y
Element formulation, results Y Y
calculation
Graphics and other pre- and Y Y [1]
postprocessing

1. This solver/operation only runs in mixed mode.

2. For static analyses and transient analyses using the full method (TRNOPT,FULL), the JCG equation solver
runs in pure DMP mode only when the matrix is symmetric. Otherwise, it runs in SMP mode.

The maximum number of cores allowed in a Distributed ANSYS analysis is currently set at 8192. Therefore,
you can run Distributed ANSYS using anywhere from 2 to 8192 cores (assuming the appropriate HPC
licenses are available) for each individual job. Performance results vary widely for every model when
using any form of parallel processing. For every model, there is a point where using more cores does
not significantly reduce the overall solution time. Therefore, it is expected that most models run in
Distributed ANSYS can not efficiently make use of hundreds or thousands of cores.

Files generated by Distributed ANSYS are named Jobnamen.ext, where n is the process number. (See
Differences in General Behavior (p. 30) for more information.) The master process is always numbered
0, and the slave processes are 1, 2, etc. When the solution is complete and you issue the FINISH command
in the SOLUTION processor, Distributed ANSYS combines all Jobnamen.RST files into a single Job-
name.RST file, located on the master machine. Other files, such as .MODE, .ESAV, .EMAT, etc., may
be combined as well upon finishing a distributed solution. (See Differences in Postprocessing (p. 35)
for more information.)

The remaining sections explain how to configure your environment to run Distributed ANSYS, how to
run a Distributed ANSYS analysis, and what features and analysis types are supported in Distributed
ANSYS. You should read these sections carefully and fully understand the process before attempting
to run a distributed analysis. The proper configuration of your environment and the installation and
configuration of the appropriate MPI software are critical to successfully running a distributed analysis.

4.1. Configuring Distributed ANSYS


To run Distributed ANSYS on a single machine, no additional setup is required.

To run an analysis with Distributed ANSYS on a cluster, some configuration is required as described in
the following sections:
4.1.1. Prerequisites for Running Distributed ANSYS
4.1.2. Setting Up the Cluster Environment for Distributed ANSYS

4.1.1. Prerequisites for Running Distributed ANSYS


Whether you are running on a single machine or multiple machines, the following condition is true:

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 17
Using Distributed ANSYS

• By default, Distributed ANSYS uses two cores and does not require any HPC licenses. Additional licenses
will be needed to run a distributed solution with more than four cores. Several HPC license options are
available. For more information, see HPC Licensing (p. 3) in the Parallel Processing Guide (p. 1).

If you are running on a single machine, there are no additional requirements for running a distributed
solution.

If you are running across multiple machines (for example, a cluster), your system must meet these
additional requirements to run a distributed solution.

• Homogeneous network: All machines in the cluster must be the same type, OS level, chip set, and inter-
connects.

• You must be able to remotely log in to all machines, and all machines in the cluster must have identical
directory structures (including the ANSYS 2020 R1 installation, MPI installation, and working directories).
Do not change or rename directories after you've launched ANSYS. For more information, see Directory
Structure Across Machines (p. 27) in the Parallel Processing Guide (p. 1).

• All machines in the cluster must have ANSYS 2020 R1 installed, or must have an NFS mount to the ANSYS
2020 R1 installation. If not installed on a shared file system, ANSYS 2020 R1 must be installed in the same
directory path on all systems.

• All machines must have the same version of MPI software installed and running. The table below shows
the MPI software and version level supported for each platform.

4.1.1.1. MPI Software


The MPI software supported by Distributed ANSYS depends on the platform (see the table below).

The files needed to run Distributed ANSYS using Intel MPI or IBM Platform MPI are included on the
installation media and are installed automatically when you install ANSYS 2020 R1. Therefore, when
running on a single machine (for example, a laptop, a workstation, or a single compute node of a
cluster) on Windows or Linux, or when running on a Linux cluster, no additional software is needed.
However, when running on multiple Windows machines (a cluster), you must install the MPI software
separately (see Installing the Software later in this section).

Table 4.2: Platforms and MPI Software

Platform MPI Software


Linux x86_64 Intel MPI 2018.3.222
IBM MPI 9.1.4.3
Windows 10 x64 Intel MPI 2018.3.210
IBM MPI 9.1.4.5
Windows HPC Server 2016 x64 Microsoft HPC Pack (MS MPI v7.1)

4.1.1.2. Installing the Software


Install ANSYS 2020 R1 following the instructions in the ANSYS, Inc. Installation Guide for your platform.
Be sure to complete the installation, including all required post-installation procedures.

To run Distributed ANSYS on a cluster, you must:

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
18 of ANSYS, Inc. and its subsidiaries and affiliates.
Configuring Distributed ANSYS

• Install ANSYS 2020 R1 on all machines in the cluster, in the exact same location on each machine.

• For Windows, you can use shared drives and symbolic links. Install ANSYS 2020 R1 on one Windows
machine (for example, C:\Program Files\ANSYS Inc\V201) and then share that installation
folder. On the other machines in the cluster, create a symbolic link (at C:\Program
Files\ANSYS Inc\V201) that points to the UNC path for the shared folder. On Windows systems,
you must use the Universal Naming Convention (UNC) for all file and path names for Distributed
ANSYS to work correctly.

• For Linux, you can use the exported NFS file systems. Install ANSYS 2020 R1 on one Linux machine
(for example, at /ansys_inc/v201), and then export this directory. On the other machines in
the cluster, create an NFS mount from the first machine to the same local directory (/an-
sys_inc/v201).

Installing MPI software on Windows


You can install Intel MPI or IBM MPI from the installation launcher by choosing Install MPI for
ANSYS, Inc. Parallel Processing. For installation instructions, see the following sections in the
ANSYS, Inc. Windows Installation Guide:

Intel-MPI 2018.3.210 Installation Instructions


IBM MPI 9.1.4.5 Installation Instructions

Microsoft HPC Pack (Windows HPC Server 2016)


You must complete certain post-installation steps before running Distributed ANSYS on a Microsoft
HPC Server 2016 system. The post-installation instructions provided below assume that Microsoft
HPC Server 2016 and Microsoft HPC Pack (which includes MS MPI) are already installed on your
system. The post-installation instructions can be found in the following README files:

Program Files\ANSYS Inc\V201\commonfiles\MPI\WindowsHPC\README.mht

or

Program Files\ANSYS Inc\V201\commonfiles\MPI\WindowsHPC\README.docx

Microsoft HPC Pack examples are also located in Program Files\ANSYS Inc\V201\common-
files\MPI\WindowsHPC. Jobs are submitted to the Microsoft HPC Job Manager either from the
command line or the Job Manager GUI.

To submit a job via the GUI, go to Start> All Programs> Microsoft HPC Pack> HPC Job Manager.
Then click on Create New Job from Description File.

4.1.2. Setting Up the Cluster Environment for Distributed ANSYS


After you've ensured that your cluster meets the prerequisites and you have ANSYS 2020 R1 and the
correct version of MPI installed, you need to configure your distributed environment using the following
procedure.

1. Obtain the machine name for each machine on the cluster.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 19
Using Distributed ANSYS

Windows 10 and Windows Server 2016:


From the Start menu, pick Settings >System >About. The full computer name is listed under PC
Name. Note the name of each machine (not including the domain).

Linux:
Type hostname on each machine in the cluster. Note the name of each machine.

2. Linux only: First determine if the cluster uses the secure shell (ssh) or remote shell (rsh) protocol.

• For ssh: Use the ssh-keygen command to generate a pair of authentication keys. Do not enter a
passphrase. Then append the new public key to the list of authorized keys on each compute node
in the cluster that you wish to use.

• For rsh: Create a .rhosts file in the home directory. Add the name of each compute node you wish
to use on a separate line in the .rhosts file. Change the permissions of the .rhost file by issuing:
chmod 600 .rhosts. Copy this .rhosts file to the home directory on each compute node in
the cluster you wish to use.

Verify communication between compute nodes on the cluster via ssh or rsh. You should not be
prompted for a password. If you are, correct this before continuing. For more information on using
ssh/rsh without passwords, search online for "Passwordless SSH" or "Passwordless RSH", or see the man
pages for ssh or rsh.

3. Windows only: Verify that all required environment variables are properly set. If you followed the post-
installation instructions described above for Microsoft HPC Pack (Windows HPC Server), these variables
should be set automatically.

On the head node, where ANSYS 2020 R1 is installed, check these variables:

ANSYS201_DIR=C:\Program Files\ANSYS Inc\v201\ansys

ANSYSLIC_DIR=C:\Program Files\ANSYS Inc\Shared Files\Licensing

where C:\Program Files\ANSYS Inc is the location of the product install and C:\Program
Files\ANSYS Inc\Shared Files\Licensing is the location of the licensing install. If
your installation locations are different than these, specify those paths instead.

On Windows systems, you must use the Universal Naming Convention (UNC) for all ANSYS, Inc.
environment variables on the compute nodes for Distributed ANSYS to work correctly.

On the compute nodes, check these variables:

ANSYS201_DIR=\\head_node_machine_name\ANSYS Inc\v201\ansys

ANSYSLIC_DIR=\\head_node_machine_name\ANSYS Inc\Shared Files\Licensing

4. Windows only: Share out the ANSYS Inc directory on the head node with full permissions so that
the compute nodes can access it.

4.1.2.1. Optional Setup Tasks


The tasks explained in this section are optional. They are not required to get Distributed ANSYS to
run correctly, but they may be useful for achieving the most usability and efficiency, depending on
your system configuration.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
20 of ANSYS, Inc. and its subsidiaries and affiliates.
Configuring Distributed ANSYS

On Linux systems, you can also set the following environment variables:

• ANSYS_NETWORK_START - This is the time, in seconds, to wait before timing out on the start-up
of the client (default is 15 seconds).

• ANSYS_NETWORK_COMM - This is the time to wait, in seconds, before timing out while commu-
nicating with the client machine (default is 5 seconds).

• ANS_SEE_RUN_COMMAND - Set this environment variable to 1 to display the actual command


issued by ANSYS.

On Windows systems, you can set the following environment variables to display the actual command
issued by ANSYS:

• ANS_SEE_RUN = TRUE

• ANS_CMD_NODIAG = TRUE

4.1.2.2. Using the mpitest Program


The mpitest program performs a simple communication test to verify that the MPI software is
set up correctly. The mpitest program should start without errors. If it does not, check your paths
and permissions; correct any errors and rerun.

When running the mpitest program, you must use an even number of processes. We recommend
you start with the simplest test between two processes running on a single node. This can be done
via the procedures outlined here for each platform and MPI type.

The command line arguments -np, -machines, and -mpifile work with the mpitest program
in the same manner as they do with Distributed ANSYS (see Starting Distributed ANSYS via Command
Line (p. 24)).

On Linux:

For Intel MPI (default), issue the following command:


mpitest201 -np 2

which is equivalent to:


mpitest201 -machines machine1:2

For IBM MPI, issue the following command:


mpitest201 -mpi ibmmpi -np 2

which is equivalent to:


mpitest201 -mpi ibmmpi -machines machine1:2

On Windows:

For Intel MPI (default), issue the following command:


ansys201 -np 2 -mpitest

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 21
Using Distributed ANSYS

which is equivalent to:


ansys201 -machines machine1:2 -mpitest

For IBM MPI, issue the following command:


ansys201 -mpi ibmmpi -np 2 -mpitest

which is equivalent to:


ansys201 -mpi ibmmpi -machines machine1:2 -mpitest

4.1.2.3. Interconnect Configuration


Using a slow interconnect reduces the performance you experience in a distributed parallel simula-
tion. For optimal performance, we recommend an interconnect with a high communication band-
width (2000 megabytes/second or higher) and a low communication latency (5 microseconds or
lower). This is due to the significant amount of data that must be transferred between processes
during a distributed parallel simulation.

Distributed ANSYS supports the following interconnects. Not all interconnects are available on all
platforms; see the Platform Support section of the ANSYS Website for a current list of supported
interconnects. Other interconnects may work but have not been tested.

• InfiniBand (recommended)

• Omni-Path (recommended)

• GigE

On Windows x64 systems, use the Network Wizard in the Compute Cluster Administrator to configure
your interconnects. See the Compute Cluster Pack documentation for specific details on setting up
the interconnects. You may need to ensure that Windows Firewall is disabled for Distributed ANSYS
to work correctly.

4.2. Activating Distributed ANSYS


After you've completed the configuration steps, you can use several methods to start Distributed ANSYS:
We recommend that you use the Mechanical APDL Product Launcher to ensure the correct settings. All
methods are explained here.

• Use the launcher (p. 23)

• Use the command line (p. 24)

• Use the HPC Job Manager on Windows x64 systems to run across multiple machines (p. 25)

• Use Remote Solve in ANSYS Workbench. (p. 25)

Notes on Running Distributed ANSYS:

You can use an NFS mount to the ANSYS 2020 R1 installation on Linux, or shared folders on Windows.
However, we do not recommend using either NFS-mounts or shared folders for the working directories.
Doing so can result in significant declines in performance.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
22 of ANSYS, Inc. and its subsidiaries and affiliates.
Activating Distributed ANSYS

Only the master process reads the config.ans file. Distributed ANSYS ignores the /CONFIG,NOELDB
and /CONFIG,FSPLIT commands.

The program limits the number of processes used to be less than or equal to the number of physical
cores on the machine. This is done to avoid running the program on virtual cores (for example, by
means of hyperthreading), which typically results in poor per-core performance. For optimal performance,
consider closing down all other applications before launching Mechanical APDL.

4.2.1. Starting Distributed ANSYS via the Launcher


Use the following procedure to start Distributed ANSYS via the launcher.

1. Open the Mechanical APDL Product Launcher:

Windows 10 and Windows Server 2016:


Start >All APPs >ANSYS 2020 R1 >Mechanical APDL Product Launcher 2020 R1

Linux:
launcher201

2. Select the correct environment and license.

3. Go to the High Performance Computing Setup tab. Select Use Distributed Computing (MPP).

Note:

On Linux systems, you cannot use the Launcher to start the program in interactive
(GUI) mode with distributed computing; this combination is blocked.

Specify the MPI type to be used for this distributed run. MPI types include:

• Intel MPI

• IBM MPI

• MS MPI (Windows only)

See the MPI table in the beginning of this chapter for the specific MPI version for each platform.
If you choose MS MPI, you cannot specify multiple hosts or an MPI file.

Choose whether you want to run on a local machine, specify multiple hosts, or specify an existing
MPI file (such as an Intel MPI configuration file or an IBM MPI appfile):

• If local machine, specify the number of cores you want to use on that machine.

• If multiple hosts, use the New Host button to add machines to the Selected Hosts list.

• If specifying an MPI file, type in the full path to the file, or browse to the file. If typing in the path,
you must use the absolute path.

Additional Options for Linux Systems On Linux systems, you can choose to use the remote
shell (rsh) protocol instead of the secure shell (ssh) protocol; ssh is the default.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 23
Using Distributed ANSYS

Also in the launcher, you can select the Use launcher-specified working directory on all nodes
option on the High Performance Computing Setup tab. This option uses the working directory
as specified on the File Management tab as the directory structure on the master and all nodes.
If you select this option, all machines will require the identical directory structure matching the
working directory specified on the launcher.

4. Click Run to launch ANSYS.

4.2.2. Starting Distributed ANSYS via Command Line


You can also start Distributed ANSYS via the command line using the following procedures.

Running on a Local Host If you are running Distributed ANSYS locally (that is, running across
multiple cores on a single machine), you need to specify the number of cores you want to use:
ansys201 -dis -np n

If you are using Intel MPI, you do not need to specify the MPI software via the command line option.
To specify IBM MPI, use the -mpi ibmmpi command line option as shown below:
ansys201 -dis -mpi ibmmpi -np n

For example, if you run a job in batch mode on a local host using four cores with an input file named
input1 and an output file named output1, the launch commands for Linux and Windows would
be as shown below.

On Linux:
ansys201 -dis -np 4 -b < input1 > output1 (for default Intel MPI)

or
ansys201 -dis -mpi ibmmpi -np 4 -b < input1 > output1 (for IBM MPI)

On Windows:
ansys201 -dis -np 4 -b -i input1 -o output1 (for default Intel MPI)

or
ansys201 -dis -mpi ibmmpi -np 4 -b -i input1 -o output1 (for IBM MPI)

Running on Multiple Hosts If you are running Distributed ANSYS across multiple hosts, you need
to specify the number of cores you want to use on each machine:
ansys201 -dis -machines machine1:np:machine2:np:machine3:np

To specify IBM MPI (instead of the default Intel MPI), use the -mpi command line option as shown
below:
ansys201 -dis -mpi ibmmpi -machines machine1:np:machine2:np:machine3:np

On Linux, you may also need to specify the shell protocol used by the MPI software. Distributed ANSYS
uses the secure shell protocol by default, but in some cluster environments it may be necessary to
force the use of the remote shell protocol. This can be done via the -usersh command line option.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
24 of ANSYS, Inc. and its subsidiaries and affiliates.
Activating Distributed ANSYS

Consider the following examples which assume a batch mode run using two machines (using four
cores on one machine and using two cores on the other machine), with an input file named input1
and an output file named output1. The launch commands for Linux and Windows would be as
shown below.

On Linux:
ansys201 -dis -b -machines machine1:4:machine2:2 < input1 > output1 (for default Intel MPI)

or
ansys201 -dis -mpi ibmmpi -b -machines machine1:4:machine2:2 < input1 > output1 (for IBM MPI)

or
ansys201 -dis -b -machines machine1:4:machine2:2 -usersh < input1 > output1 (for remote shell protocol between

On Windows:
ansys201 -dis -b -machines machine1:4:machine2:2 -i input1 -o output1 (for default Intel MPI)

or
ansys201 -dis -mpi ibmmpi -b -machines machine1:4:machine2:2 -i input1 -o output1 (for IBM MPI)

The first machine specified with -machines in a Distributed ANSYS run must be the host machine
and must contain any files necessary for the initiation of your job (i.e., input file, config.ans file,
database file, etc.).

If both the -np and -machines options are used on a single command line, the -np will be ignored.

Specifying a Preferred Parallel Feature License If you have more than one HPC license feature,
you can use the -ppf command line option to specify which HPC license to use for the parallel run.
See HPC Licensing (p. 3) for more information.

4.2.3. Starting Distributed ANSYS via the HPC Job Manager


If you are running on Windows x64 systems using Microsoft HPC Pack (MS MPI), you need to use the
HPC Job Manager to start Distributed ANSYS. For more information, refer to the following README
files:

Program Files\ANSYS Inc\V201\commonfiles\MPI\WindowsHPC\README.mht

or

Program Files\ANSYS Inc\V201\commonfiles\MPI\WindowsHPC\README.docx

4.2.4. Starting Distributed ANSYS in the Mechanical Application (via ANSYS


Workbench)
If you are running ANSYS Workbench, you can start a Distributed ANSYS job in the Mechanical applic-
ation. Go to Tools > Solve Process Settings; select the remote solve process you want to use and
click the Advanced button. To enable Distributed ANSYS, ensure that Distribute Solution (if possible)
is selected (this is the default). If necessary, enter any additional command line arguments to be
submitted; the options are described in Starting Distributed ANSYS via Command Line (p. 24). If

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 25
Using Distributed ANSYS

Distribute Solution (if possible) is selected, you do not need to specify the -dis flag on the com-
mand line.

If you are running a remote solution on multiple machines, use the -machines option to specify
the machines and cores on which to run the job. If you are running a remote solution on one machine
with multiple cores, specify the number of cores in the Max Number of Utilized Cores field; no
command line arguments are needed.

For more information on running a distributed solution in ANSYS Workbench, see Using Solve Process
Settings.

4.2.5. Using MPI Files


You can specify an existing MPI file (such as an Intel MPI configuration file) on the command line
rather than typing out multiple hosts or a complicated command line:
ansys201 -dis -mpifile file_name

For an IBM MPI appfile, include the -mpi command line option:
ansys201 -dis -mpi ibmmpi -mpifile file_name

The format of the appfile is specific to the MPI library being used. Refer to the documentation provided
by the MPI vendor for the proper syntax.

If the file is not in the current working directory, you will need to include the full path to the file. The
file must reside on the local machine.

You cannot use the -mpifile option in conjunction with the -np (local host) or -machines
(multiple hosts) options.

For details on working with the MPI file, see the documentation provided by the MPI vendor.

Example MPI files for both Intel MPI and IBM MPI are shown below. For use on your system, modify
the hostnames (mach1, mach2), input filename (inputfile), and output filename (outputfile)
accordingly. Additional command-line arguments, if needed, can be added at the end of each line.

Intel MPI Configuration File Examples


Intel MPI uses a configuration file to define the machine(s) that will be used for the simulation. Typ-
ical Intel MPI configuration files are shown below.

On Linux:
-host mach1 -np 2 /ansys_inc/v201/ansys/bin/ansysdis201 -dis -b -i inputfile -o outputfile
-host mach2 -np 2 /ansys_inc/v201/ansys/bin/ansysdis201 -dis -b -i inputfile -o outputfile

On Windows:
-host mach1 -np 2 "C:\Program Files\ANSYS Inc\V201\ANSYS\bin\Winx64\ANSYS.exe" -dis -b -i inputfile -o outputfile
-host mach2 -np 2 "C:\Program Files\ANSYS Inc\V201\ANSYS\bin\Winx64\ANSYS.exe" -dis -b -i inputfile -o outputfile

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
26 of ANSYS, Inc. and its subsidiaries and affiliates.
Supported Analysis Types and Features

IBM MPI Appfile Examples


IBM MPI uses an appfile to define the machine(s) that will be used for the simulation. Typical IBM MPI
appfiles are shown below. Note that -mpi ibmmpi must be included as a command line argument.

On Linux:
-h mach1 -np 2 /ansys_inc/v201/ansys/bin/ansysdis201 -dis -mpi ibmmpi -b -i inputfile -o outputfile
-h mach2 -np 2 /ansys_inc/v201/ansys/bin/ansysdis201 -dis -mpi ibmmpi -b -i inputfile -o outputfile

On Windows:
-h mach1 -np 2 "C:\Program Files\ANSYS Inc\V201\ANSYS\bin\Winx64\ANSYS.exe" -dis -mpi ibmmpi -b -i inputfile -o ou
-h mach2 -np 2 "C:\Program Files\ANSYS Inc\V201\ANSYS\bin\Winx64\ANSYS.exe" -dis -mpi ibmmpi -b -i inputfile -o ou

4.2.6. Directory Structure Across Machines


Distributed ANSYS writes files to the master and slave machines (or compute nodes) as the analysis
progresses.

The working directory for each machine can be on a local drive or on a network shared drive. For
optimal performance, use local disk storage rather than network storage. Set up the same working
directory path structure on the master and slave machines.

When setting up your cluster environment, consider that Distributed ANSYS:

• cannot launch if identical working directory structures have not been set up on the master and all slave
machines.

• always uses the current working directory on the master machine and expects identical directory structures
to exist on all slave nodes. (If you are using the launcher, the working directory specified on the File
Management tab is the directory that Distributed ANSYS expects.)

4.3. Supported Analysis Types and Features


Distributed ANSYS does not support all analysis types, elements, solution options, etc. available in
shared-memory ANSYS. Distributed ANSYS may therefore:

• Stop an analysis to avoid performing an unsupported action, requiring you to launch shared-memory ANSYS
to perform the simulation.

• Disable the distributed-memory parallel processing capability and perform the operation using shared-
memory parallelism.

The disabling of the distributed-memory parallel processing can occur at various levels in the program.
For more information, see Distributed ANSYS Behavior (p. 15).

4.3.1. Supported Analysis Types


This section lists analysis capabilities that are supported in Distributed ANSYS. This is not a compre-
hensive list, but represents major features and capabilities found in the ANSYS program.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 27
Using Distributed ANSYS

Most element types are valid in an analysis that uses distributed-memory parallel processing (including
but not limited to the elements mentioned below). For those element types not supported by Distrib-
uted ANSYS, a restriction is included in the element description (see the Element Reference). Check
the assumptions/restrictions list for the element types you are using.

Supported Analysis Types

The following analysis types are supported and use distributed-memory parallelism throughout the
Distributed ANSYS solution. (That is, the solution runs in pure DMP mode (p. 16).)

• Linear static and nonlinear static analyses for single-field structural problems (DOFs: UX, UY, UZ, ROTX,
ROTY, ROTZ, WARP) and single-field thermal problems (DOF: TEMP).

• Buckling analyses using the Subspace or Block Lanczos eigensolver (BUCOPT,SUBSP; LANB).

• Modal analyses using the Subspace, PCG Lanczos, Block Lanczos, Unsymmetric, Damped, or QR damped
eigensolver (MODOPT,SUBSP; LANPCG; LANB; UNSYM; DAMP; or QRDAMP).

• Harmonic analyses, except when using the Variational Technology reuse method (HROPT,VTRU).

• Transient dynamic analyses.

• Substructuring analyses, including component mode synthesis (CMS) analyses.

• Spectrum analyses.

• Radiation analyses using the radiosity method.

• Low-frequency electromagnetic analyses.

• Coupled-field analyses.

• Superelements in the use pass of a substructuring analysis.

• Cyclic symmetry analyses (except mode-superposition harmonic).

The following analysis types are supported and use distributed-memory parallelism throughout the
Distributed ANSYS solution, except for the equation solver which uses shared-memory parallelism.
(In these cases, the solution runs in mixed mode (p. 16).)

• Static and full transient analyses (linear or nonlinear) that use the JCG or ICCG equation solvers. Note that
when the JCG equation solver is used in these analysis types, the JCG solver will actually run using distrib-
uted-memory parallelism (that is, pure DMP mode) if the matrix is symmetric.

• Modal analyses using the Supernode eigensolver (MODOPT,SNODE).

• Full harmonic analyses using the JCG, ICCG, or QMR equation solvers.

4.3.2. Supported Features


This section list features that are supported in Distributed ANSYS and features that are blocked. These
are not comprehensive lists, but represent major features and capabilities found in the ANSYS program.

Supported Features:

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
28 of ANSYS, Inc. and its subsidiaries and affiliates.
Supported Analysis Types and Features

The following features are supported and use distributed-memory parallelism throughout the Distrib-
uted ANSYS solution. (That is, the solution runs in pure DMP mode (p. 16).)

• Large deformations (NLGEOM,ON).

• Line search (LNSRCH,ON).

• Auto time stepping (AUTOTS,ON).

• Initial conditions (IC).

• Initial state (INISTATE).

• Nonlinear material properties specified by the TB command.

• Gasket elements and pre-tension elements.

• Lagrange multiplier based mixed u-P elements and TARGE169 - CONTA178.

• Contact nonlinearity (TARGE169 - CONTA178)

• User programmable features, including the user-defined element (USER300).

• Multi-frame restarts.

• Arc-length method (ARCLEN).

• Prestress effects (PSTRES).

• Inertia relief (IRLF,1), including the mass summary option (IRLF,-1).

• Multiple load steps and enforced motion in modal analyses (MODCONT).

• Residual vectors calculations (RESVEC).

The following feature is supported and uses distributed-memory parallelism throughout the Distributed
ANSYS solution, except for the remeshing procedure. (In this case, the solution runs in mixed
mode (p. 16).)

• Mesh nonlinear adaptivity (NLADAPTIVE).

The following analysis type is supported and uses distributed-memory parallelism throughout the
Distributed ANSYS solution, except for the 3-D model-creation procedure:

• 2-D to 3-D analysis (MAP2DTO3D)

The following features are supported, but do not use distributed-memory parallelism within Distributed
ANSYS. (The solution runs in pure SMP mode (p. 16).)

• Mode-superposition harmonic cyclic symmetry analysis.

• VCCT Crack Growth Simulation when fracture parameters in addition to CINT,VCCT are calculated on the
same geometric crack front. See VCCT Crack-Growth Simulation Assumptions in the Fracture Analysis Guide
for more information.

• MPC184 rigid link/beam element using the direct elimination method (KEYOPT(2) = 0).

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 29
Using Distributed ANSYS

Blocked Features:

The following feature is not supported by Distributed ANSYS:

• Element morphing.

4.4. Understanding the Working Principles and Behavior of Distributed


ANSYS
The fundamental difference between Distributed ANSYS and shared-memory ANSYS is that N number
of ANSYS processes will be running at the same time (where N is the total number of CPU processor
cores used) for one model. These N processes may be running on a single machine (in the same working
directory) or on multiple machines. These processes are not aware of each other's existence unless they
are communicating (sending messages). Distributed ANSYS, along with the MPI software, provide the
means by which the processes communicate with each other in the right location and at the appropriate
time.

The following topics give a summary of behavioral differences between Distributed ANSYS and shared-
memory ANSYS in specific areas of the ANSYS program:
4.4.1. Differences in General Behavior
4.4.2. Differences in Solution Processing
4.4.3. Differences in Postprocessing
4.4.4. Restarts in Distributed ANSYS

4.4.1. Differences in General Behavior


File Handling Conventions Upon startup and during a parallel solution, Distributed ANSYS appends
n to the current jobname (where n stands for the process rank). The
master process rank is 0 and the slave processes are numbered from
1 to N - 1. In the rare case that the user-supplied jobname ends in a
numeric value (0...9) or an underscore, an underscore is automatically
appended to the jobname prior to appending the process rank. This
is done to avoid any potential conflicts with the new file names.

Therefore, upon startup and during a parallel solution, each process


will create and use files named Jobnamen.EXT. These files con-
tain the local data that is specific to each Distributed ANSYS pro-
cess. Some common examples include the .LOG and .ERR files
as well as most files created during solution such as .ESAV,
.FULL, .RST, and .MODE. See Program-Generated Files in the
Basic Analysis Guide for more information on files that ANSYS
typically creates.

Actions that are performed only by the master process (/PREP7,


/POST1, etc.) will work on the global Jobname.EXT files by de-
fault. These files (such as Jobname.DB and Jobname.RST)
contain the global data for the entire model. For example, only
the master process will save and resume the Jobname.DB file.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
30 of ANSYS, Inc. and its subsidiaries and affiliates.
Understanding the Working Principles and Behavior of Distributed ANSYS

After a parallel solution successfully completes, Distributed ANSYS


automatically merges some of the (local) Jobnamen.EXT files
into a single (global) file named Jobname.EXT. These include
files such as Jobname.RST, Jobname.ESAV, Jobname.EMAT,
and so on (see the DMPOPTION command for a complete list of
these files). This action is performed when the FINISH command
is executed upon leaving the solution processor. These files contain
the same information about the final computed solution as files
generated for the same model computed with shared-memory
ANSYS. Therefore, all downstream operations (such as postpro-
cessing) can be performed using shared-memory ANSYS (or in the
same manner as shared-memory ANSYS) by using these global
files.

If any of these global Jobname.EXT files are not needed for


downstream operations, you can reduce the overall solution time
by suppressing the file combination for individual file types (see
the DMPOPTION command for more information). If it is later
determined that a global Jobname.EXT file is needed for a sub-
sequent operation or analysis, the local files can be combined by
using the COMBINE command.

DMPOPTION also has an option to combine the results file at


certain time points during the distributed solution. This enables
you to postprocess the model while the solution is in progress,
but leads to slower performance due to increased data communic-
ation and I/O.

Distributed ANSYS will not delete most files written by the slave
processes when the analysis is completed. If you choose, you can
delete these files when your analysis is complete (including any
restarts that you may wish to perform). If you do not wish to have
the files necessary for a restart saved, you can issue RESCON-
TROL,NORESTART.

File copy, delete, and rename operations can be performed across


all processes by using the DistKey option on the /COPY, /DE-
LETE, and /RENAME commands. This provides a convenient way
to manage local files created by a distributed parallel solution. For
example, /DELETE,Fname,Ext,,ON automatically appends the
process rank number to the specified file name and deletes
Fnamen.Ext from all processes. See the /COPY, /DELETE, and
/RENAME command descriptions for more information. In addition,
the /ASSIGN command can be used to control the name and
location of specific local and global files created by Distributed
ANSYS.

Batch and Interactive Mode You can launch Distributed ANSYS in either interactive or batch mode
for the master process. However, the slave processes are always in
batch mode. The slave processes cannot read the start.ans or
stop.ans files. The master process sends all /CONFIG,LABEL com-
mands to the slave processes as needed.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 31
Using Distributed ANSYS

On Windows systems, there is no ANSYS output console window


when running the Distributed ANSYS GUI (interactive mode). All
standard output from the master process will be written to a file
named file0.out. (Note that the jobname is not used.)

Output Files When a Distributed ANSYS job is executed, the output for the master
process is written in the same fashion as shared-memory ANSYS. By
default, the output is written to the screen; or, if you specified an
output file via the launcher or the -o command line option, the output
for the master process is written to that file.

The slave processes automatically write this output to Job-


namen.OUT. Typically, these slave process output files have little
value because all of the relevant job information is written to the
screen or the master process output file. The exception is when
the domain decomposition method is automatically chosen to be
FREQ (frequency-based for a harmonic analysis) or CYCHI (harmonic
index-based for a cyclic symmetry analysis); see the DDOPTION
command for more details. In these cases, the solution information
for the harmonic frequencies or cyclic harmonic indices solved by
the slave processes are only written to the output files for those
processes (Jobnamen.OUT).

Error Handling The same principle also applies to the error file Jobnamen.ERR. When
a warning or error occurs on one of the slave processes during the
Distributed ANSYS solution, the process writes that warning or error
message to its error file and then communicates the warning or error
message to the master process. Typically, this allows the master process
to write the warning or error message to its error file and output file
and, in the case of an error message, allows all of the Distributed ANSYS
processes to exit the program simultaneously.

In some cases, an error message may fail to be fully communicated


to the master process. If this happens, you can view each Job-
namen.ERR and/or Jobnamen.OUT file in an attempt to learn
why the job failed. The error files and output files written by all
the processes will be incomplete but may still provide some useful
information as to why the job failed.

In some rare cases, the job may hang. When this happens, you
can use the cleanup script (or .bat file on Windows) to kill the
processes. The cleanup script is automatically written into the
initial working directory of the master process and is named
cleanup-ansys-[machineName]-[processID].sh (or
.bat).

Use of APDL In pre- and postprocessing, APDL works the same in Distributed ANSYS
as in shared-memory ANSYS. However, in the solution processor
(/SOLU), Distributed ANSYS does not support certain *GET items. In
general, Distributed ANSYS supports global solution *GET results such
as total displacements and reaction forces. It does not support element
level results specified by ESEL, ESOL, and ETABLE labels. Unsupported
items will return a *GET value of zero.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
32 of ANSYS, Inc. and its subsidiaries and affiliates.
Understanding the Working Principles and Behavior of Distributed ANSYS

Condensed Data Input Multiple commands entered in an input file can be condensed into a
single line if the commands are separated by the $ character (see
Condensed Data Input in the Command Reference). Distributed ANSYS
cannot properly handle condensed data input. Each command must
be placed on its own line in the input file for a Distributed ANSYS run.

4.4.2. Differences in Solution Processing


Domain Decomposition Upon starting a solution, Distributed ANSYS automatically decomposes
(DDOPTION Command) the problem into N CPU domains so that each process (or each CPU
core) works on only a portion of the simulation. This domain decom-
position is done only by the master process. Typically, the optimal
domain decomposition method is automatically chosen based on a
variety of factors (analysis type, number of CPU cores, available RAM,
and so on). However, you can control which domain decomposition
method is used by setting the Decomp argument on the DDOPTION
command.

For most simulations, the program automatically chooses the


mesh-based domain decomposition method (Decomp = MESH),
which means each CPU domain is a group or subset of elements
within the whole model. For certain harmonic analyses, the domain
decomposition may be based on the frequency domain (Decomp
= FREQ), in which case each CPU domain computes the harmonic
solution for the entire model at a different frequency point. For
certain cyclic symmetry analyses, the domain decomposition may
be based on the harmonic indices (Decomp = CYCHI), in which
case each CPU domain computes the cyclic solution for a different
harmonic index.

The NPROCPERSOL argument on the DDOPTION command gives


you the flexibility to combine the FREQ or CYCHI decomposition
methods with the mesh-based domain decomposition method.
Consider a harmonic analysis with 50 frequency points requested
(NSUBST,50) run on a workstation using Distributed ANSYS with
16 cores (-dis -np 16). Using mesh decomposition (DDOP-
TION,MESH) essentially solves one frequency at a time with 16
groups of elements (1x16). Using DDOPTION,FREQ,1 solves 16
frequencies at a time with 1 group of elements; that is, the entire
FEA model (16x1). Using the NPROCPERSOL field allows you to
consider alternative combinations in between these 2 scenarios.
You could try DDOPTION,FREQ,2 to solve 8 frequencies at a time
with 2 groups of elements per solution (8x2), or DDOPTION,FREQ,4
to solve 4 frequencies with 4 groups of elements (4x4), and so on.
Note that the total core count specified at startup cannot be
altered, and the program works to maintain the NPROCPERSOL
value as input, which means the number of frequency or cyclic
harmonic index solutions solved at a time may need to be adjusted
to fit within the other defined parameters.

In the case of a linear perturbation harmonic analysis, if the de-


composition is based on the frequency domain (Decomp = FREQ

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 33
Using Distributed ANSYS

or automatically chosen when Decomp = AUTO), use the NUMLP-


FREQS argument to specify the number of frequency solutions
in the subsequent harmonic analysis. This ensures the best per-
formance for this type of analysis.

There are pros and cons for each domain decomposition approach.
For example, with the default MESH method, the total amount of
memory required to solve the simulation in N processes is typically
not much larger than the amount of memory required to solve
the simulation on one process. However, when using the FREQ or
CYCHI domain decomposition method, the amount of memory
required to solve the simulation on N processes is typically N times
greater than the amount of memory to solve the simulation on
one process. As another example, with the MESH domain decom-
position method, the application requires a significant amount of
data communication via MPI, thus requiring a fast interconnect
for optimal performance. With the FREQ and CYCHI methods, very
little data communication is necessary between the processes and,
therefore, good performance can still be achieved using slower
interconnect hardware.

All FEA data (elements, nodes, materials, sections, real constants,


boundary conditions, etc.) required to compute the solution for
each CPU domain is communicated to the slave processes by the
master process. Throughout the solution, each process works only
on its piece of the entire model. When the solution phase ends
(for example, FINISH is issued in the solution processor), the
master process in Distributed ANSYS works on the entire model
again (that is, it behaves like shared-memory ANSYS).

Print Output (OUTPR Com- In Distributed ANSYS, the OUTPR command prints NSOL and RSOL in
mand) the same manner as in shared-memory ANSYS. However, for other
items such as ESOL, Distributed ANSYS prints only the element solution
for the group of elements belonging to the CPU domain of the master
process. Therefore, OUTPR, ESOL has incomplete information and is
not recommended. Also, the order of elements is different from that
of shared-memory ANSYS due to domain decomposition. A direct one-
to-one element comparison with shared-memory ANSYS will be different
if using OUTPR.

Large Number of CE/CP Both shared-memory ANSYS and Distributed ANSYS can handle a large
and Contact Elements number of coupling and constraint equations (CE/CP) and contact
elements. However, specifying too many of these items can force Dis-
tributed ANSYS to communicate more data among each process, res-
ulting in longer elapsed time to complete a distributed parallel job.
You should reduce the number of CE/CP if possible and make potential
contact pairs in a smaller region to achieve non-deteriorated perform-
ance. In addition, for assembly contact pairs or small sliding contact
pairs, you can use the command CNCHECK,TRIM to remove contact
and target elements that are initially in far-field (open and not near
contact). This trimming option will help to achieve better performance
in Distributed ANSYS runs.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
34 of ANSYS, Inc. and its subsidiaries and affiliates.
Understanding the Working Principles and Behavior of Distributed ANSYS

4.4.3. Differences in Postprocessing


Postprocessing with Data- Shared-memory ANSYS can postprocess the last set of results using
base File and SET Com- the Jobname.DB file (if the solution results were saved), as well as
mands using the Jobname.RST file. Distributed ANSYS, however, can only
postprocess using the Jobname.RST file and cannot use the Job-
name.DB file as solution results are not entirely written to the database.
You will need to issue a SET command before postprocessing.

Postprocessing with Mul- By default, Distributed ANSYS will automatically combine the local
tiple Results Files results files (for example, Jobnamen.RST) into a single global results
file (Jobname.RST). This step can be expensive depending on the
number of load steps and the amount of results stored for each solu-
tion. It requires each local results file to be read by each slave process,
communicated to the master process, and then combined together
and written by the master process. As a means to reduce the amount
of communication and I/O performed by this operation, the DMPOP-
TION command can be used to skip the step of combining the local
results file into a single global results file. Then the RESCOMBINE
command macro can be used in /POST1 to individually read each
local results file until the entire set of results is placed into the database
for postprocessing. If needed, a subsequent RESWRITE command can
then be issued to write a global results file for the distributed solution.

Note that if the step of combining the results file is skipped, it


may affect downstream analyses that rely on a single global results
file for the entire model. If it is later determined that a global
results file (e.g., Jobname.RST) is needed for a subsequent oper-
ation, you can use the COMBINE command to combine the local
results files into a single, global results file.

4.4.4. Restarts in Distributed ANSYS


Distributed ANSYS supports multiframe restarts for nonlinear static, full transient, and mode-superpos-
ition transient analyses. The procedures and command controls are the same as described in the Basic
Analysis Guide. However, restarts in Distributed ANSYS have additional limitations based on the pro-
cedure used, as described in the following sections:

• Procedure 1: Use the Same Number of Cores (p. 35)

• Procedure 2: Use a Different Number of Cores (p. 37)

See also Additional Consideration for the Restart (p. 38) for more information on restarting a distrib-
uted-memory parallel solution.

4.4.4.1. Procedure 1 - Use the Same Number of Cores


This procedure requires that you use the same number of cores for the restart as in the original
run. It does not require any additional commands beyond those used in a typical multiframe restart
procedure.

• The total number of cores used when restarting Distributed ANSYS must not be altered following the
first load step and first substep.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 35
Using Distributed ANSYS

Example 1: If you use the following command line for the first load step:
ansys201 -dis -np 8 -i input -o output1

then, you must also use 8 cores (-dis -np 8) for the multiframe restart, and the files from the
original analysis that are required for the restart must be located in the current working directory.
Additionally, this means you cannot perform a restart using shared-memory parallel processing
(shared-memory ANSYS) if Distributed ANSYS was used prior to the restart point.

• When running across machines, the job launch procedure (or script) used when restarting Distributed
ANSYS must not be altered following the first load step and first substep. In other words, you must use
the same number of machines, the same number of cores for each of the machines, and the same host
(master) and slave relationships among these machines in the restart job that follows.

Example 2: If you use the following command line for the first load step where the host machine
(which always appears first in the list of machines) is mach1, and the slave machines are mach2
and mach3:
ansys201 –dis –machines mach1:4:mach2:1:mach3:2 –i input –o output1

then for the multiframe restart, you must use a command line such as this:
ansys201 –dis –machines mach7:4:mach6:1:mach5:2 –i restartjob –o output2

This command line uses the same number of machines (3), the same number of cores for each
machine in the list (4:1:2), and the same host/slave relationship (4 cores on host, 1 core on first
slave, and 2 cores on second slave) as the original run. Any alterations in the -machines field,
other than the actual machine names, will result in restart failure. Finally, the files from the ori-
ginal analysis that are required for the restart must be located in the current working directory
on each of the machines.

• The files needed for a restart must be available on the machine(s) used for the restarted analysis. Each
machine has its own restart files that are written from the previous run. The restart process needs to
use these files to perform the correct restart actions.

For Example 1 above, if the two analyses (-dis –np 8) are performed in the same working
directory, no action is required; the restart files will already be available. However, if the restart
is performed in a new directory, all of the restart files listed in Table 4.3: Required Files for Multi-
frame Restart - Procedure 1 (p. 36) must be copied (or moved) into the new directory before
performing the multiframe restart.

For Example 2 above, the restart files listed in the “Host Machine” column in Table 4.3: Required
Files for Multiframe Restart - Procedure 1 (p. 36) must be copied (or moved) from mach1 to
mach7, and all of the files in the “Slave Machines” column must be copied (or moved) from
mach2 to mach6 and from mach3 to mach5 before performing the multiframe restart.

Table 4.3: Required Files for Multiframe Restart - Procedure 1

Host Machine Slave Machines


Jobname.LDHI --
Jobname.RDB --

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
36 of ANSYS, Inc. and its subsidiaries and affiliates.
Understanding the Working Principles and Behavior of Distributed ANSYS

Host Machine Slave Machines


Jobname.RDnn if remeshed due to nonlinear --
adaptivity, where nn is the number of remeshings
before the restart
Jobname0.Xnnn [1] Jobnamen.Xnnn (where n is the process
rank and nnn is a restart file identifier)
Jobname0.RST (this is the local .RST file for this Jobnamen.RST (this is the local .RST file
domain) [2] for this domain) [2]

1. The .Xnnn file extension mentioned here refers to the .Rnnn and .Mnnn files discussed in Multi-
frame File Restart Requirements in the Basic Analysis Guide.

2. The Jobnamen.RST files are optional. The restart can be performed successfully without them.

4.4.4.2. Procedure 2 - Use a Different Number of Cores


In this procedure, the total number of cores used when restarting Distributed ANSYS can be altered
following the first load step and first substep. In addition, you can perform the restart using either
a distributed-memory parallel solution (Distributed ANSYS) or a shared-memory parallel solution
(shared-memory ANSYS). Some additional steps beyond the typical multiframe restart procedure
are required to ensure that necessary files are available.

Note:

This procedure is not available when performing a restart for the following analysis types:
mode-superposition transient analysis, an analysis that includes mesh nonlinear adaptivity,
and 2-D to 3-D analysis.

• In this procedure, the Jobname.Rnnn file must be available. This file is not generated by default in
Distributed ANSYS, even when restart controls are activated via the RESCONTROL command. You must
either use the command DMPOPTION,RNN,YES in the prior (base) analysis, or you must manually
combine the Jobnamen.Rnnn files into the Jobname.Rnnn file using the COMBINE command.

• For example, if you use the following command line for the first load step:
ansys201 –dis –np 8 –i input –o output1

then for the multiframe restart, you can use more or less than 8 cores (-np N, where N does
not equal 8).

• The files from the original analysis that are required for the restart must be located in the current
working directory. If running across machines, the restart files are only required to be in the current
working directory on the host machine.

Table 4.4: Required Files for Multiframe Restart - Procedure 2

Host Machine Slave


Machines
Jobname.LDHI --
Jobname.RDB --

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 37
Using Distributed ANSYS

Host Machine Slave


Machines
Jobname.RDnn if remeshed due to nonlinear adaptivity, where nn is the number --
of remeshings before the restart
Jobname.Rnnn --
Jobname.RST [1] --

1. The Jobname.RST file is optional. The restart can be performed successfully without it.

4.4.4.3. Additional Considerations for the Restart


The advantage of using Procedure 1 (same number of cores) is faster performance by avoiding the
potentially costly steps of combining the Jobnamen.Rnnn files and later splitting the Job-
namen.Rnnn (and possibly the Jobname.RST/RTH/RMG) files during the restarted analysis. The
disadvantage is that there are more files that need to be managed during the restart process.

The advantage of Procedure 2 (different number of cores) is less files to manage during the restart
process, but at the cost of additional time to combine the local restart files and then, later on, to
split this file data during the restarted analysis. Depending on the size of the simulation, this overhead
may be insignificant.

In all restarts, the Jobname.RST results file (or Jobname.RTH or Jobname.RMG) on the host
machine is recreated after each solution by merging the Jobnamen.RST files again.

If you do not require a restart, issue RESCONTROL,NORESTART in the run to remove or to avoid
writing the necessary restart files on the host and slave machines. If you use this command, the
slave processes will not have files such as .ESAV, .OSAV, .RST, or .X000, in the working directory
at the end of the run. In addition, the host process will not have files such as .ESAV, .OSAV,
.X000, .RDB, or .LDHI at the end of the run. The program will remove all of the above scratch
files at the end of the solution phase (FINISH or /EXIT). This option is useful for file cleanup and
control.

4.5. Example Problems


This section contains tutorials for running Distributed ANSYS on Linux and Windows platforms. To
download all three files for these examples, in .zip format, click here.

This tutorial is divided into two parts.

The first part walks you through setting up your distributed environment and running a test to verify
that communication between systems is occurring correctly. Running this part is optional (though
strongly recommended), and need only be done when you initially configure your environment, or
when you change your environment by adding, removing, or changing machines.

The second part of the tutorial walks you through setting up your distributed environment and then
directly running a sample problem. The tutorial is designed so that you can modify settings (such as
the number of machines in the cluster, number of cores on each machine, etc.), but we strongly recom-
mend that the first time you run this tutorial, you follow the steps exactly. Once you're familiar with
the process, you can then modify the tutorial to more closely match your particular environment. You
can then use the files generated in future GUI or batch runs.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
38 of ANSYS, Inc. and its subsidiaries and affiliates.
Example Problems

Both static and modal examples are provided. The problem setup for each platform is the same
whether you run the static or the modal example.

4.5.1. Example: Running Distributed ANSYS on Linux


The following tutorial walks you through the setup of your Distributed ANSYS environment, and is
applicable only to systems running ANSYS 2020 R1 on a 64-bit Linux cluster under IBM MPI 9.1.4.3.
The ANSYS 2020 R1 installation includes IBM MPI 9.1.4.3.

One of these sample problems, tutor1_carrier_linux.inp (static) or tutor2_carrier_mod-


al.inp (modal), is required to complete the tutorial. Once you've downloaded the files, save them
to your working directory before beginning the tutorial. You can run either sample problem using
the problem setup described here.

Part A: Setup and Run mpitest

1. Set up identical installation and working directory structures on all machines (master and slaves) in the
cluster.

2. Type hostname on each machine in the cluster. Note the name of each machine. You will need this
name to set up the .rhosts file (for rsh protocol), and to specify host names in the Mechanical APDL
Product Launcher.

Set up the .rhosts file on each machine. The .rhosts file lists each machine in the cluster,
followed by your username. The machines should be listed using their complete system name,
as taken from uname. For example, each .rhosts file for our two-machine cluster looks like
this (where golinux1 and golinux2 are example machine names, and jqd is an example username):
golinux1 jqd
golinux2 jqd

Change/verify .rhosts file permissions on all machines by issuing:


chmod 600 .rhosts

Navigate to your working directory. Run the following:


/ansys_inc/v201/ansys/bin/mpitest201

The mpitest program should start without errors. If it does not, check your paths, .rhosts file, and
permissions; correct any errors; and rerun.

Part B: Setup and Run a Distributed Solution

1. Set up identical installation and working directory structures on all machines (master and slaves) in the
cluster.

2. Install ANSYS 2020 R1 on the master machine, following the typical installation process.

3. Install ANSYS 2020 R1 on the slave machines.

Steps 2 and 3 above will install all necessary components on your machines, including IBM MPI
9.1.4.3.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 39
Using Distributed ANSYS

4. Type hostname on each machine in the cluster. Note the name of each machine. You will need this
name to set up the .rhosts file.

5. Set up the .rhosts file on each machine. The .rhosts file lists each machine in the cluster, followed
by your username. The machines should be listed using their complete system name, as taken from
uname. For example, each .rhosts file for our two-machine cluster looks like this (where golinux1
and golinux2 are example machine names, and jqd is an example username):
golinux1 jqd
golinux2 jqd

6. Change/verify .rhosts file permissions on all machines by issuing:


chmod 600 .rhosts

7. Verify communication between machines via rsh. If the communication between machines is happening
correctly, you will not need a password.

8. Start ANSYS using the launcher:


launcher201

9. Select the correct environment and license.

10. Go to the High Performance Computing Setup tab. Select Use Distributed Computing (MPP). You
must also specify either local machine or multiple hosts. For multiple hosts, use the New Host button
to add machines to the Selected Hosts list.

If necessary, you can also run secure shell (ssh) by selecting Use Secure Shell instead of Remote
Shell (ssh instead of rsh).

11. Click Run to launch ANSYS.

12. In ANSYS, select File>Read Input From and navigate to tutor1_carrier_linux.inp or tu-
tor2_carrier_modal.inp.

13. The example will progress through the building, loading, and meshing of the model. When it stops,
select Main Menu>Solution>Analysis Type>Sol'n Controls.

14. On the Solution Controls dialog box, click on the Sol'n Options tab.

15. Select the Pre-Condition CG solver.

16. Click OK on the Solution Controls dialog box.

17. Solve the analysis. Choose Main Menu>Solution>Solve>Current LS. Click OK.

18. When the solution is complete, you can postprocess your results as you would with any analysis. For
example, you could select Main Menu>General Postproc>Read Results>First Set and select the desired
result item to display.

4.5.2. Example: Running Distributed ANSYS on Windows


The following tutorial walks you through the setup of your Distributed ANSYS environment, and is
applicable only to systems running ANSYS 2020 R1 on a Windows cluster under IBM MPI 9.1.4.5.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
40 of ANSYS, Inc. and its subsidiaries and affiliates.
Troubleshooting

One of these sample problems, tutor1_carrier_win.inp (static) or tutor2_carrier_mod-


al.inp (modal), is required to complete the tutorial. Once you've downloaded the files, save them
to your working directory before beginning the tutorial. You can run either sample problem using
the problem setup described here.

1. Set up identical installation and working directory structures on all machines (master and slaves) in the
cluster.

2. Install ANSYS 2020 R1 on the master machine, following the typical installation process.

3. Configure ANSYS 2020 R1 on the slave machines.

4. Install and register IBM MPI 9.1.4.5 on both machines following the instructions in the beginning of this
chapter.

5. Add %MPI_ROOT%\bin to the PATH environmental variable on both machines (assuming IBM MPI
9.1.4.5 was installed on the C:\ drive). This line must be in your path for the mpirun command to be
recognized.

6. On each machine, right-click on My Computer, left-click on Properties, and select the Network Iden-
tification or Computer Name tab. The full computer name will be listed. Note the name of each machine
(not including the domain). You will need this name to set up the selected hosts in the Mechanical
APDL Product Launcher.

7. Start ANSYS using the launcher: Start >Programs >ANSYS 2020 R1 > Mechanical APDL Product
Launcher 2020 R1.

8. Select ANSYS Batch as the Simulation Environment, and choose a license. Specify tutor1_carri-
er_win.inp or tutor2_carrier_modal.inp as your input file. Both of these examples use the
PCG solver. You must specify your working directory to be the location where this file is located.

9. Go to the High Performance Computing Setup tab. Select Use Distributed Computing (MPP). You
must specify either local machine or multiple hosts. For multiple hosts, use the New Host button to
add machines to the Selected Hosts list.

10. Click Run.

11. When the solution is complete, you can postprocess your results as you would with any analysis.

4.6. Troubleshooting
This section describes problems which you may encounter while using Distributed ANSYS, as well as
methods for overcoming these problems. Some of these problems are specific to a particular system,
as noted.

4.6.1. Setup and Launch Issues


To aid in troubleshooting, you may need to view the actual MPI run command line. On Linux the
command is mpirun, and you can view the command line by setting the ANS_SEE_RUN_COMMAND
environment variable to 1. On Windows the command is mpiexec, and you can view the command
line by setting the ANS_SEE_RUN and ANS_CMD_NODIAG environment variables to TRUE.

Job fails to launch

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 41
Using Distributed ANSYS

The first thing to check when a Distributed ANSYS job fails to launch is that the MPI software you
wish to use is installed and properly configured (see Configuring Distributed ANSYS (p. 17)).

Next, if running across multiple machines, ensure that the working directory path is identical on
all machines (or that you are using a shared network directory) and that you have permission to
write files into the working directory used by each machine.

Finally, make sure that you are running the distributed solution on a homogeneous cluster. The
OS level and processors must be identical on all nodes in the cluster. If they are not, you may
encounter problems. For example, when running Distributed ANSYS across machines using Intel
MPI, if the involved cluster nodes have different processor models, the program may hang (i.e.,
fail to launch). In this situation, no data is written to any files and no error message is output.

When using Distributed ANSYS with Intel MPI on a Windows cluster, the program may hang when
launching. One workaround is to switch to IBM MPI. Another workaround is to run on a Windows
HPC server cluster which uses Microsoft MPI.

No permission to system

This error can occur if you do not have login access to a remote system where Distributed ANSYS
is supposed to run. If you use Linux, you can experience this problem if you have not set permis-
sions on the .rhosts file properly. Before starting Distributed ANSYS, be sure you have access
to all the remote systems you have specified (i.e., you should be able to rsh to those systems)
and that the control node is included in its own .rhosts file. If you run on Linux, be sure you
have run the following command:
chmod 600 .rhosts

MPI: could not run executable

If you encounter this message, verify that you have the correct version of MPI and that it is installed
correctly, and verify that you have a .rhosts file on each machine. If not, create a .rhosts
file on all machines where you will run Distributed ANSYS, make sure the permissions on the file
are 600, and include an entry for each hostname where you will run Distributed ANSYS.

Error executing ANSYS. Refer to System-related Error Messages in the ANSYS online help. If
this was a Distributed ANSYS job, verify that your MPI software is installed correctly, check
your environment settings, or check for an invalid command line option.

You may encounter the above message when setting up IBM MPI or running Distributed ANSYS
using IBM MPI on a Windows platform. This may occur if you did not correctly run the set password
bat file. Verify that you completed this item according to the IBM MPI installation instructions.

You may also see this error if Ansys Inc\v201\ansys\bin\<platform> (where <plat-
form> is intel or winx64) is not in your PATH.

If you need more detailed debugging information, use the following:

1. Open a Command Prompt window and set the following:


SET ANS_SEE_RUN=TRUE
SET ANS_CMD_NODIAG=TRUE

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
42 of ANSYS, Inc. and its subsidiaries and affiliates.
Troubleshooting

2. Run the following command line: ansys201 -b -dis -i myinput.inp -o myoutput.out.

Distributed ANSYS fails to launch when running from a fully-qualified pathname.

Distributed ANSYS will fail if the ANSYS 2020 R1 installation path contains a space followed by
a dash if %ANSYS201_DIR%\bin\<platform> (where <platform> is intel or winx64) is not
in the system PATH. Add %ANSYS201_DIR%\bin\<platform> to the system PATH and invoke
ansys201 (without the fully qualified pathname). For example, if your installation path is:
C:\Program Files\Ansys -Inc\v201\bin\<platform>

The following command to launch Distributed ANSYS will fail:


"C:\Program Files\Ansys -Inc\v201\bin\<platform>\ansys201.exe” -g

However, if you add C:\Program Files\Ansys -Inc\v201\bin\<platform> to the


system PATH, you can successfully launch Distributed ANSYS by using the following command:
ansys201 -g

The required licmsgs.dat file, which contains licensing-related messages, was not found or could
not be opened. The following path was determined using environment variable ANSYS201_DIR.
This is a fatal error - - exiting.

Check the ANSYS201_DIR environment variable to make sure it is set properly. Note that for
Windows HPC clusters, the ANSYS201_DIR environment variable should be set to \\HEADNODE\An-
sys Inc\v201\ansys, and the ANSYSLIC_DIR environment variable should be set to \\HEADNODE\An-
sys Inc\Shared Files\Licensing on all nodes.

WARNING: Unable to read mpd.hosts or list of hosts isn't provided. MPI job will be run on the
current machine only. (Intel MPI)

When using Intel MPI, you must have an mpd.hosts file in your working directory when going
across boxes that contain a line for each box. Otherwise, you will encounter the following error.

WARNING: Unable to read mpd.hosts or list of hosts isn't provided. MPI job will be run on the
current machine only.
mpiexec: unable to start all procs; may have invalid machine names
remaining specified hosts:
xx.x.xx.xx (hostname)

4.6.2. Stability Issues


This section describes potential stability issues that you may encounter while running Distributed
ANSYS.

Intel MPI May Crash on AMD Processors


When using Distributed ANSYS with Intel MPI on systems having AMD processors, the program may
become unstable and crash. The workaround is to use IBM MPI instead of Intel MPI.

Recovering from a Computer, Network, or Program Crash


When a Distributed ANSYS job crashes unexpectedly (e.g., seg vi, floating point exception, out-of-disk
space error), an error message may fail to be fully communicated to the master process and written into

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 43
Using Distributed ANSYS

the output file. If this happens, you can view all of the output and/or error files written by each of the
slave processes (e.g., Jobnamen.OUT and/or Jobnamen.ERR) in an attempt to learn why the job
failed. In some rare cases, the job may hang. When this happens, you must manually kill the processes;
the error files and output files written by all the processes will be incomplete but may still provide some
useful information as to why the job failed.

Be sure to kill any lingering processes (Linux: type kill -9 from command level; Windows: use Task
Manager) on all processors and start the job again.

Job Fails with SIGTERM Signal (Linux Only)


Occasionally, when running on Linux, a simulation may fail with a message like the following:

MPI Application rank 2 killed before MPI_Finalize() with signal 15

forrtl: error (78): process killed (SIGTERM)

This typically occurs when computing the solution and means that the system has killed the ANSYS
process. The two most common occurrences are (1) ANSYS is using too much of the hardware
resources and the system has killed the ANSYS process or (2) a user has manually killed the ANSYS
job (i.e., kill -9 system command). Users should check the size of job they are running in relation
to the amount of physical memory on the machine. Most often, decreasing the model size or
finding a machine with more RAM will result in a successful run.

4.6.3. Solution and Performance Issues


This section describes solution and performance issues that you may encounter while running Distrib-
uted ANSYS.

Poor Speedup or No Speedup


As more cores are utilized, the runtimes are generally expected to decrease. The biggest relative gains
are typically achieved when using two cores compared to using a single core. When significant speedups
are not seen as additional cores are used, the reasons may involve both hardware and software issues.
These include, but are not limited to, the following situations.

Hardware
Oversubscribing hardware In a multiuser environment, this could mean that more physical cores
are being used by multiple simulations than are available on the machine. It could also mean that
hyperthreading is activated. Hyperthreading typically involves enabling extra virtual cores, which
can sometimes allow software programs to more effectively use the full processing power of the
CPU. However, for compute-intensive programs such as ANSYS, using these virtual cores rarely
provides a significant reduction in runtime. Therefore, it is recommended you do not use hyperthread-
ing; if hyperthreading is enabled, it is recommended you do not exceed the number of physical
cores.

Lack of memory bandwidth On some systems, using most or all of the available cores can
result in a lack of memory bandwidth. This lack of memory bandwidth can affect the overall
scalability.

Slow interconnect speed When running Distributed ANSYS across multiple machines, the
speed of the interconnect (GigE, Infiniband, etc.) can have a significant effect on the perform-
ance. Slower interconnects cause each Distributed ANSYS process to spend extra time waiting
for data to be transferred from one machine to another. This becomes especially important
as more machines are involved in the simulation. See Interconnect Configuration at the begin-
ning of this chapter for the recommended interconnect speed.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
44 of ANSYS, Inc. and its subsidiaries and affiliates.
Troubleshooting

Software
Simulation includes non-supported features The shared and distributed-memory parallelisms
work to speed up certain compute-intensive operations in /PREP7, /SOLU and /POST1. However,
not all operations are parallelized. If a particular operation that is not parallelized dominates the
simulation time, then using additional cores will not help achieve a faster runtime.

Simulation has too few DOF (degrees of freedom) Some analyses (such as transient
analyses) may require long compute times, not because the number of DOF is large, but be-
cause a large number of calculations are performed (i.e., a very large number of time steps).
Generally, if the number of DOF is relatively small, parallel processing will not significantly
decrease the solution time. Consequently, for small models with many time steps, parallel
performance may be poor because the model size is too small to fully utilize a large number
of cores.

I/O cost dominates solution time For some simulations, the amount of memory required
to obtain a solution is greater than the physical memory (i.e., RAM) available on the machine.
In these cases, either virtual memory (i.e., hard disk space) is used by the operating system
to hold the data that would otherwise be stored in memory, or the equation solver writes
extra files to the disk to store data. In both cases, the extra I/O done using the hard drive can
significantly affect performance, making the I/O performance the main bottleneck to achieving
optimal performance. In these cases, using additional cores will typically not result in a signi-
ficant reduction in overall time to solution.

Large contact pairs For simulations involving contact pairs with a large number of elements
relative to the total number of elements in the entire model, the performance of Distributed
ANSYS is often negatively affected. These large contact pairs require Distributed ANSYS to do
extra communication and often cause a load imbalance between each of the cores (i.e., one
core might have two times more computations to perform than another core). In some cases,
using CNCHECK,TRIM can help trim any unnecessary contact/target elements from the larger
contact pairs. In other cases, however, manual interaction will be required to reduce the
number of elements involved in the larger contact pairs.

Different Results Relative to a Single Core


Distributed-memory parallel processing initially decomposes the model into domains. Typically, the
number of domains matches the number of cores. Operational randomness and numerical round-off
inherent to parallelism can cause slightly different results between runs on the same machine(s) using
the same number of cores or different numbers of cores. This difference is often negligible. However, in
some cases the difference is appreciable. This sort of behavior is most commonly seen on nonlinear
static or transient analyses which are numerically unstable. The more numerically unstable the model
is, the more likely the convergence pattern or final results will differ as the number of cores used in the
simulation is changed.

Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
of ANSYS, Inc. and its subsidiaries and affiliates. 45
Release 2020 R1 - © ANSYS, Inc. All rights reserved. - Contains proprietary and confidential information
46 of ANSYS, Inc. and its subsidiaries and affiliates.

You might also like