Posted inNetworking

Top Tips for Debugging and Optimizing NVIDIA Networking Performance – Router Switch Blog

[ad_1]

Premier Mellanox Selections: best-selling Mellanox Switches, Network Cards, and Cables

In today’s high-speed networking world, optimizing and troubleshooting performance is crucial, especially with high-performance equipment like NVIDIA Infiniband switches. Whether you’re a data center admin or network engineer, mastering effective techniques is key. In this blog, we’ll share top tips for debugging and optimizing NVIDIA Infiniband networking performance. Discover how to fine-tune your NVIDIA Infiniband Networking for peak efficiency, ensuring flawless performance under heavy loads.

NVIDIA Infiniband NetworkingImage Source: Pexels

System configuration

CPU/BIOS-General debugging parameters

  • NUMA region-NPS1/NPS2 in AMD CPU

CPU cores and IB devices will bring significant latency differences, and NUMA partitions affect memory bandwidth

  • High priority IO-NIC B:D.F

Set the target IB device to the IO device that the system prioritizes scheduling

  • Performance Mode/ P-state/ C-state-Enable/Disable/Disable

Energy-saving mode brings CPU wake-up delay

Run single-core single-thread to reduce thread switching overhead

Optimize IO mapping and disable data caching process

  • Power Limit Control-Highest(225)

The highest power consumption mode brings the effect of CPU overclocking performance

(Read CPU/BIOS-General debugging parameters for additional information)

AMD ROME

AMD Rome refers to the second generation of AMD’s EPYC server processors. These processors are built on the Zen 2 architecture and use a 7nm process technology, which enhances performance and energy efficiency compared to previous generations.

● AMD ROME-NUMA per Socket(NPS)

In the context of AMD Rome, “per socket” means the architecture is considered for each physical CPU socket on the motherboard. Each AMD Rome processor sits in a socket, and the NUMA configuration is applied individually to each of these sockets.

● NPS Configuration:1,1 or 4

NPS=1 implies that the AMD Rome CPU is within a single NUMA domain (all the cores and all memory channels). Memory is interleaved across the eight memory channels. In this cases, more bandwidth can be pushed to the HDR adapter

NPS=4 on the other hand divides the socket to 4 memory regions of two channels each. In this case the memory is more local to the CPU and memory latency is smaller

● AMD ROME-NPS Recommended Configuration

  1. For bandwidth benchmark testing (HDR 200Gbps), NPS=1 or 2 is the optimal setting, so that the memory bandwidth can be fully utilized.
  2. For 100G network testing, NPS=1/2/4 can meet the requirements.

Recommended Configuration-NPS 2

Recommended Configuration-NPS 4

Image Source: NVIDIA

Memory bandwidth

The bandwidth test has a high demand for memory bandwidth. It is recommended to use the highest supported DDR DIMM and fully insert it.

Single channel DDR4 memory theoretical bandwidth 64 bit*2667Mbps=170.7Gbps

64 bit*3200Mbps=204.8Gbps

Check the memory speed under the system dmidecode-t 17 | grep Speed | grep C
Check the memory capacity configuration of the NUMA partition numactl-H
Memory bandwidth test tool stream

PCIE

Important setup options

  1. MAX READ REQUEST SIZE-Set to the maximum capability of the device. This parameter can reduce the packet splitting of read operations.
  2. MAX PAYLOAD SIZE-Set to the maximum capability of RC/Switch/Device to reduce the splitting of PCle TLP send and receive packets
  3. Relex ordering-For AMD CPUs you need to enable Device configuration that marks all DMA writes with RO bit in PCle header. This allows higher bandwidth on AMD CPUsPCle Switch/Retimer-Minimize cascading as much as possible. Placing GPU and NIC on the same level of PCle switch is the optimal configuration. PCle Switch/Retimer introduces 100-150 lantency for each cascade.

IRQ/NUMA affinity

  • Make sure the test application runs on the CPU core in the same NUMA partition as the IB card, and run multiple task flows if necessary
  • Make sure the IRQ processing flow bound to the IB card is on the CPU core of the same NUMA partition – in the following case, it is CPU core4.

       e.g: set_irq_affinity_cpulist.sh 4 ib0

       taskset –c 4 application/benchmark

      numactl –phycpubind=4 –membind=0 application/benchmark

(Read “NVIDIA Networking – InfiniBand Solutions” for additional information)

Benchmark tools

RDMA Latency performance test

  • Make sure to use the core local to the HCA, in this example core 80
  • In this example, I am using also 10000 iteration to make sure the output is smooth
  • Expected latency is around 1usec (0.97-1.02) for Rome 7742 2.25GHz using HDR adapter over a switch
  • NPS Configuration is not critical here
  • Command: numactl –physcpubind=80 ib_write_lat-a -d mlx5_2 -i 1 –report_gbits-F -n 10000 & ssh rome002 numactl –physcpubind=80 ib_write_lat-a -d mlx5_2 -i 1 –report_gbits-F rome001 -n 10000

RDMA One-way bandwidth test-NPS1

  • Similarly, make sure to use the core local to the HCA, in this example core 80
  • In this example, I am using also 10000 iteration to make sure the output is smooth
  • Expected bandwidth is HDR line rate around 8-16K message size. Use multi QP if needed.
  • Command: numactl –physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 –report_gbits-F -n 10000 & ssh rome002 numactl –physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 –report_gbits-F rome001 -n 10000

RDMA Bidirectional bandwidth test-NPS1

  • Similarly, make sure to use the core local to the HCA, in this example core 80
  • In this example, I am using also 10000 iteration to make sure the output is smooth
  • Expected average BW is line rate around 32K message size
  • Command: numactl –physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 –report_gbits-F -n 10000 -b & ssh rome002 numactl –physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 –report_gbits-F rome001 -b -n 10000

RDMA Bidirectional bandwidth test-NPS4

  • Similarly, make sure to use the core local to the HCA, in this example core 80
  • In this example, I am using also 10000 iteration to make sure the output is smooth
  • Expected peak BW is line rate, but average BW is lower
  • Command: numactl –physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 –report_gbits-F -n 10000 -b & ssh rome002 numactl –physcpubind=80 ib_write_bw -a -d mlx5_2 -i 1 –report_gbits-F rome001 -b -n 10000

Buy High-end NVIDIA Infiniband Products @Router-swith.com

GPU DIRECT Performance Testing

  • Make sure to use GPU and IB cards with link affinity. Both need to ensure good performance under the same CPU RC
  • The best configuration is that the GPU and IB are under the same level PCle switch, which can fully utilize the PCle switch p2p characteristics and minimize latency
  • NCCL support is required, and the recommended NCCL version is not less than 2.8.
  • Specify the GPU and IB devices for testing according to the numbers in the topology
  • Command: numactl –physcpubind=80 ib_write_bw -d mlx5_0 –use_cuda=0 –a–report_gbits

OSU MPI Latency Test

  • Make sure to use latest HPC-X
  • Use the local core to the adapter
  • In this example, I am using also 10000 iteration to make sure the output is smooth
  • Expected latency is around 1.1usec (1.05-1.15) for Rome 7742 2.25GHz using HDR adapter over a switch
  • OSU 5.6.2
  • Use UCX, Disable HCOLL
  • Command: mpirun -map-by ppr:1:node -rank-by core -bind-to cpu-list:ordered -cpu-list 80 -mca coll_hcoll_enable 0 -mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1 osu_latency -i 100000 -x 100000

OSU MPI Bandwidth test

  • Make sure to use latest HPC-X
  • Use the local core to the adapter
  • In this example, I am using also 10000 iteration to make sure the output is smooth
  • Expected bandwidth is line rate for large message size for Rome 7742 2.25GHz using HDR adapter over a switch
  • OSU 5.6.2
  • Set NPS=1 on the BIOS
  • Use UCX, Disable HCOLL
  • Command: mpirun –np 2 -map-by ppr:1:node -rank-by core -bind-to cpu-list:ordered -cpu-list 80 -mca coll_hcoll_enable 0-mca pml ucx -x UCX_NET_DEVICES=mlx5_2:1

       osu_bw -i 100000 -x 100000

Performance diagnostics

Physical link bit error rate – BER

  • PCle link monitoring-BER target≤1E-12 for Gen3

      lspci –s B:D.F –vvv //Check link width, speed, Error status

      mlxlink –d mlx5_0 –port_type PCIe –e

  • NIC/IN port monitoring– See BER target below

      mlxlink –d mlx5_0 –e

    BER thresholds default values

BER type No-FEC(EDR) RS-FEC LL RS-FEC KP4 RS-FEC
Warn alarm Warn alarm Warn alarm Warm alarm
Raw BER 1e-13 5e-12 1e-6 1e-5 1e-5 5e-5 5e-5 1e-4
Effective BER 1e-13 5e-12 1e-13 5e-12 1e-12 1e-11 1e-13 1e-11
Symbol BER 1e-13 5e-12 1e-13 5e-12 1e-13 1e-12 1e-13 5e-12

SWITCH port monitoring-COUNTERS

  • Monitor the counters of a specific port

      e.g: perfquery lid port –r

  • The rapid increase of Error counter indicates that the link condition of the port is poor
  • The rapid increase of PortXmitWait indicates the occurrence of congestion, which may be the cause of bandwidth problems

SWITCH port monitoring-COUNTERS

Image Source: NVIDIA

Performance diagnosis on the HOST side

  • Monitor CPU usage. When the CPU core where the task is located has reached 100% usage and the bandwidth is still insufficient, you can add a new task process to run on other local CPU cores

      e.g: top –d 1

      e.g: watch -n 1 numactl –hardware to watch numa nodes memory utilization

  • If the local CPU core is fully loaded, select the CPU core with the next best NUMA distance to run the new task process
  • Network port bandwidth monitoring

      e.g: mlnx_perf –i ib0 |grep tx_bytes_phy

  • Network card PCle bandwidth monitoring

      e.g: python /opt/neohost/sdk/get_device_performance_counters.py–mode=shel –dev-uid=0000:B:D.F–get-analysis –run-loop

By following the top tips, we have outlined for debugging and optimizing your infiniband networking, since you can significantly improve the efficiency and reliability of your high-speed NVIDIA networking. Regular firmware updates, comprehensive network traffic monitoring, and tailored configuration adjustments are key strategies to ensure optimal performance. Additionally, adopting a systematic approach to troubleshooting can help you quickly identify and resolve issues before they impact your network. Special attention to components like NVIDIA Infiniband switches will further enhance the robustness and overall performance of your networking setup, ensuring it meets your specific requirements and operational demands.

For those looking to invest in high-quality networking equipment, Router-switch.com is a trusted resource. Renowned for offering top-notch products at competitive prices for 22 years, like nvidia infiniband switches. We ensure to offer reliable and cost-effective solutions for your networking needs.



[ad_2]

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *