Tag Archives: cluster

Fixing OpenMPI over InfiniBand on Rocks Cluster Linux

We recently got a new small compute cluster at the university, running Rocks Clusters Linux 6.1.1, a CentOS 6 derivative. The nodes are interconnected via an InfiniBand network. Unfortunately, the default configuration of OpenMPI 1.6.2 in the HPC roll wastes a significant amount of performance: it communicates using TCP, which is run over a load-balanced combination of IP over InfiniBand and IP over Ethernet.

Switching to DMA over InfiniBand is simple: just run the following command on all compute nodes and the head node:

sed -i 's/add rocks-openmpi/add rocks-openmpi_ib/g' /etc/profile.d/rocks-hpc.*sh

Now however, you get a message like this when you run an MPI job:

--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:              bee.icp.uni-stuttgart.de
  Registerable memory:     32768 MiB
  Total memory:            130967 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
--------------------------------------------------------------------------

To fix that, run

echo "options mlx4_core log_num_mtt=24" >> /etc/modprobe.d/mlx4.conf

on all nodes and reboot. log_mtts_per_seg defaulted to 3 on our kernel and did not need tweaking. To check your current values, run

grep . /sys/module/mlx4_core/parameters/*mtt*

One warning message that still comes up when running an MPI job is the following:

--------------------------------------------------------------------------
WARNING: Failed to open "OpenIB-cma-1" [DAT_INVALID_ADDRESS:]. 
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--------------------------------------------------------------------------
bee.icp.uni-stuttgart.de:30104:  open_hca: getaddr_netdev ERROR: No such device. Is ib1 configured?
bee.icp.uni-stuttgart.de:30104:  open_hca: device mthca0 not found
bee.icp.uni-stuttgart.de:30104:  open_hca: device mthca0 not found
DAT: library load failure: libdaplscm.so.2: cannot open shared object file: No such file or directory
DAT: library load failure: libdaplscm.so.2: cannot open shared object file: No such file or directory

As UDAPL is removed in newer OpenMPI versions anyway, this is fixed by running

echo "btl = ^udapl" >> /opt/openmpi/etc/openmpi-mca-params.conf

on all compute nodes and the head node.

So all in all, you can simply add the following lines to /export/rocks/install/site-profiles/6.1.1/nodes/extend-compute.xml and rebuild your compute node image:

echo "btl = ^udapl" >> /opt/openmpi/etc/openmpi-mca-params.conf
sed -i 's/add rocks-openmpi/add rocks-openmpi_ib/g' /etc/profile.d/rocks-hpc.*sh
echo "options mlx4_core log_num_mtt=24" >> /etc/modprobe.d/mlx4.conf
dracut -f 2.6.32-504.16.2.el6.x86_64 # may need to rebuild the initrd so it picks up the modprobe parameters