We recently got a new small compute cluster at the university, running Rocks Clusters Linux 6.1.1, a CentOS 6 derivative. The nodes are interconnected via an InfiniBand network. Unfortunately, the default configuration of OpenMPI 1.6.2 in the HPC roll wastes a significant amount of performance: it communicates using TCP, which is run over a load-balanced combination of IP over InfiniBand and IP over Ethernet.
Switching to DMA over InfiniBand is simple: just run the following command on all compute nodes and the head node:
sed -i 's/add rocks-openmpi/add rocks-openmpi_ib/g' /etc/profile.d/rocks-hpc.*sh
Now however, you get a message like this when you run an MPI job:
-------------------------------------------------------------------------- WARNING: It appears that your OpenFabrics subsystem is configured to only allow registering part of your physical memory. This can cause MPI jobs to run with erratic performance, hang, and/or crash. This may be caused by your OpenFabrics vendor limiting the amount of physical memory that can be registered. You should investigate the relevant Linux kernel module parameters that control how much physical memory can be registered, and increase them to allow registering all physical memory on your machine. See this Open MPI FAQ item for more information on these Linux kernel module parameters: http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages Local host: bee.icp.uni-stuttgart.de Registerable memory: 32768 MiB Total memory: 130967 MiB Your MPI job will continue, but may be behave poorly and/or hang. --------------------------------------------------------------------------
To fix that, run
echo "options mlx4_core log_num_mtt=24" >> /etc/modprobe.d/mlx4.conf
on all nodes and reboot. log_mtts_per_seg defaulted to 3 on our kernel and did not need tweaking. To check your current values, run
grep . /sys/module/mlx4_core/parameters/*mtt*
One warning message that still comes up when running an MPI job is the following:
-------------------------------------------------------------------------- WARNING: Failed to open "OpenIB-cma-1" [DAT_INVALID_ADDRESS:]. This may be a real error or it may be an invalid entry in the uDAPL Registry which is contained in the dat.conf file. Contact your local System Administrator to confirm the availability of the interfaces in the dat.conf file. -------------------------------------------------------------------------- bee.icp.uni-stuttgart.de:30104: open_hca: getaddr_netdev ERROR: No such device. Is ib1 configured? bee.icp.uni-stuttgart.de:30104: open_hca: device mthca0 not found bee.icp.uni-stuttgart.de:30104: open_hca: device mthca0 not found DAT: library load failure: libdaplscm.so.2: cannot open shared object file: No such file or directory DAT: library load failure: libdaplscm.so.2: cannot open shared object file: No such file or directory
As UDAPL is removed in newer OpenMPI versions anyway, this is fixed by running
echo "btl = ^udapl" >> /opt/openmpi/etc/openmpi-mca-params.conf
on all compute nodes and the head node.
So all in all, you can simply add the following lines to /export/rocks/install/site-profiles/6.1.1/nodes/extend-compute.xml and rebuild your compute node image:
echo "btl = ^udapl" >> /opt/openmpi/etc/openmpi-mca-params.conf sed -i 's/add rocks-openmpi/add rocks-openmpi_ib/g' /etc/profile.d/rocks-hpc.*sh echo "options mlx4_core log_num_mtt=24" >> /etc/modprobe.d/mlx4.conf dracut -f 2.6.32-504.16.2.el6.x86_64 # may need to rebuild the initrd so it picks up the modprobe parameters