Monthly Archives: November 2020

Ubuntu 20.04: OpenMPI bind-to NUMA is broken when running without mpiexec

I tend to set the CPU pinning for my OpenMPI programs to the NUMA node. That way, they always access fast local memory without having to cross between processors. Some recent CPUs like the AMD Ryzen Threadripper have multiple NUMA nodes per socket, so pinning to the socket is not the same thing.

Since upgrading to Ubuntu 20.04, we were seeing error messages like this:

$ python3 -m mpi4py.bench helloworld
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

 Setting processor affinity failed failed
 --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

Launching through mpiexec/mpirun, even if it was with just one MPI rank, did not show the error:

$ mpiexec -n 1 python3 -m mpi4py.bench helloworld
Hello, World! I am process 0 of 1 on host1.
$ mpirun -n 1 python3 -m mpi4py.bench helloworld
Hello, World! I am process 0 of 1 on host1.
$ mpiexec -n 4 python3 -m mpi4py.bench helloworld
Hello, World! I am process 3 of 4 on host1.
Hello, World! I am process 0 of 4 on host1.
Hello, World! I am process 1 of 4 on host1.
Hello, World! I am process 2 of 4 on host1.

If you look through the OpenMPI code, you can see that CPU pinning is done by different code depending on whether you run standalone (called singleton mode) or through mpiexec. The relevant bit for the former is in ess_base_fns.c. It searches for a hwloc object of type HWLOC_OBJ_NODE (which is deprecated on the hwloc side and identical to the newer HWLOC_OBJ_NUMANODE). Since hwloc 2.0, NUMA nodes are no longer containers for CPU cores, but exist besides them inside a HWLOC_OBJ_GROUP.

$ lstopo --version
lstopo 11.1.9
$ lstopo --output-format console
Machine (31GB total) + Package L#0
  NUMANode L#0 (P#0 16GB)
    L3 L#0 (8192KB)
      L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (64KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#12)
[…]
$ lstopo --version
lstopo 2.1.0
$ lstopo --output-format console
Machine (31GB total) + Package L#0
  Group0 L#0
    NUMANode L#0 (P#0 16GB)
    L3 L#0 (8192KB)
      L2 L#0 (512KB) + L1d L#0 (32KB) + L1i L#0 (64KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#12)
[…]

The current OpenMPI master (i.e. versions beyond the 4.1.x series) don’t bind through hwloc anymore, so the issue is fixed upstream (if only by accident). However, we’re stuck with Ubuntu 20.04 for the next two years, so let’s fix it ourselves. We load up the incriminating file, /usr/lib/x86_64-linux-gnu/openmpi/lib/libopen-rte.so.40.20.3, in Hopper and jump to orte_ess_base_proc_binding. Comparing it to its C code quickly reveals the instruction we need to change:

0x3 is OPAL_BIND_TO_NUMA and 0xd is HWLOC_OBJ_NODE. Looking at the hex code tells us that we need to make this change:

- 66 83 F8 03 0F 85 70 02 00 00 BA 0D 00 00 00
+ 66 83 F8 03 0F 85 70 02 00 00 BA 0C 00 00 00

Here’s a bit of Python code to do that:

import mmap
with open("/usr/lib/x86_64-linux-gnu/openmpi/lib/libopen-rte.so.40.20.3", 'r+b') as f:
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_WRITE)
m.seek(m.find(bytes.fromhex("66 83 F8 03 0F 85 70 02 00 00 BA 0D 00 00 00")))
m.write( bytes.fromhex("66 83 F8 03 0F 85 70 02 00 00 BA 0C 00 00 00"))

What to do when Mathematica’s ParallelMap/ParallelTable takes a long time to start up

I have a Mathematica notebook that derives some rather massive expressions. I wanted to do some transformations on them in parallel using ParallelMap or ParallelTable, but noticed that these commands were only running on a single CPU core for hours before actually starting to run in parallel and occupy all CPU cores. While it was running on only that single CPU core, I could not even abort the evaluation using Alt-. like one usually can: it simply seemed stuck.

make_massive_expression[x_] := ...;
process[x_] := Simplify[x];
a1 = simple_expression;
a2 = make_massive_expression[a1];
a3 = make_massive_expression[a2];
as = {a1,a2,a3};

b = ParallelTable[process[as[[i]]], {i,Length[as]}];

As it turns out, during the startup phase Mathematica copies all definitions from the main kernel to the parallel kernels. And that seems to be a rather inefficient procedure. So let’s transfer the needed definitions manually.

make_massive_expression[x_] := ...;
process[x_] := Simplify[x];
a1 = simple_expression;
a2 = make_massive_expression[a1];
a3 = make_massive_expression[a2];
as = {a1,a2,a3};

DistributeDefinitions[as, process];
b = ParallelTable[process[as[[i]]], {i,Length[as]}, DistributedContexts -> None];

Now DistributeDefinitions is slow, but ParallelTable immediately starts running in parallel on multiple kernels. We haven’t gained anything by splitting things like this, but at least we can now tell exactly where the problem lies. So instead of transferring the massive expressions to the parallel kernels, let’s only transfer the simple expression and have the parallel kernels derive the massive expression themselves:

make_massive_expression[x_] := ...;
process[x_] := Simplify[x];
a1 = simple_expression;

DistributeDefinitions[a1, make_massive_expression, process];

ParallelEvaluate[(
   a2 = make_massive_expression[a1];
   a3 = make_massive_expression[a2];
   as = {a1,a2,a3}
), DistributedContexts -> None];

b = ParallelTable[process[as[[i]]], {i,Length[as]}, DistributedContexts -> None];

Leserbrief “Corona-Einschränkungen”

Im November 2020 beschlossen die Landesregierungen, einen großen Teil der Maßnahmen wiederherzustellen, die sie bereits im Frühjahr gegen die Ausbreitung des Coronavirus ergriffen hatten. Am 4. November 2020 druckte die Süddeutsche Zeitung dazu einen von mir verfassten Leserbrief:

Zu viel Optimismus

Der zweite Quasi-Lockdown zeigt, dass der erste keinen bleibenden Nutzen gestiftet hat, sondern lediglich das Unvermeidliche um einige Monate verzögerte. Auch der dritte oder vierte wird uns nicht nah genug an ein Heilmittel bringen, so sehr wir uns das auch wünschen mögen. Gleichzeitig setzt sich immer mehr die Erkenntnis durch, dass die in die Impfstoffentwicklung gesetzte Hoffnung viel zu optimistisch war und ein Impfstoff voraussichtlich die Eindämmungsmaßnahmen nicht obsolet machen wird. Man muss also durchaus die Frage stellen, ob das Ziel, das wir zu erreichen suchen, überhaupt erreichbar ist. Auch für die rechtliche Bewertung ist diese Frage elementar: Ist eine Maßnahme ungeeignet, ihr Ziel zu erreichen, so ist sie unverhältnismäßig. An einem übermächtigen Gegner wie einer Naturkatastrophe zu scheitern, ist jedenfalls keine Schande. Im Gegenteil, es zeigt, dass wir immer noch Menschen und keine Götter sind. Leider sind Politiker nicht bekannt dafür, eigene Fehler eingestehen zu können. Dies wird aber nötig sein, da es mit der aktuellen Strategie wohl kein „nach Corona“ geben wird – wenn man bloßes Wunschdenken überhaupt als Strategie bezeichnen kann.

Michael Kuron, Frickenhausen