Category Archives: Linux

Root disk spindown on Debian 9

I recently installed Debian 9 on a Seagate PersonalCloud. Because the device will only get backed up to once every day, I want its disk to be spun down when it’s not needed. Even on a minimal install, you’ll find quite a few background services that access the root disk every few minutes. Here is what I had to do to keep my disk spun down.

hdparm

First, install hdparm (apt-get install hdparm) and configure /etc/hdparm.conf to spin down your disks after 10 minutes:

#quiet
spindown_time = 120

Since hdparm isn’t available in the initrd, when the rule /lib/udev/rules.d/85-hdparm.rules fires, you need to add /etc/systemd/system/hdparm-sda.service:

[Unit]
Description=hdparm sda
ConditionPathExists=/lib/udev/hdparm

[Service]
Type=forking
Environment=DEVNAME=/dev/sda
ExecStart=/lib/udev/hdparm
TimeoutSec=0
StandardOutput=tty
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Now systemctl daemon-reload && systemctl enable hdparm-sda && systemctl start hdparm-sda.

cron.hourly

When you look into /var/log/syslog, you see messages like

CRON[393]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)

Of course, we can’t have these messages written to a log if we want the disk to remain spun down, so edit /etc/crontab and comment out the hourly job. As long as /etc/cron.hourly is empty, this will not do any harm. If you have any hourly jobs, you might want to move them to daily jobs.

systemctl timers

Systemctl has its own cron-like timer mechanism. You can view active timers with systemctl list-timers --all and disable ones you don’t need, especially ones that run more than once a day. systemctl disable snapper-timeline.timer && systemctl stop snapper-timeline.timer.

smartd

If you have smartmontools installed (apt-get install smartmontools), you’ll see lines like the following appear in the syslog when the disk is spun down:

smartd[294]: Device: /dev/sda [SAT], is in STANDBY mode, suspending checks

Writing these messages causes the disk to spin up, so we need to disable smartd: systemctl stop smartd && systemctl disable smartd. To keep monitoring our disks, put the following into /etc/cron.daily/smart-check and then chmod +x /etc/cron.daily/smart-check:

#!/bin/bash

/usr/sbin/smartctl -q errorsonly -A /dev/sda

systemd-tmpfiles-clean

When you run

inotifywait -m -r -e access -e modify -e create -e delete --format 'PATH:%w%f EVENTS:%,e' --exclude '/(dev/pts|proc|sys|run)' /

to see what is going on on your disk, you’ll see that your temporary directories are being cleaned every couple of minutes. We can reduce that to bi-weekly by running systemctl edit systemd-tmpfiles-clean.timer and pasting

[Timer]
OnBootSec=5min
OnUnitActiveSec=14d

postfix

If you have Postfix installed, inotify will also show you that it periodically checks its queue directories. So uninstall postfix (apt-get remove postfix) and configure a forwarding MTA that does not run as a daemon.

systemd-timesync

This last one was tricky to discover because it doesn’t appear in the logs and inotify doesn’t see it. You can logging of every disk access to the kernel log to see even more, but you need to disable syslog, otherwise you’ll get a self-amplifying write cascade.

systemctl stop syslog.socket
systemctl stop rsyslog.service
dmesg -C
echo 1 | sudo tee /proc/sys/vm/block_dump
dmesg -Tw

Here you’ll see that systemd-timesyncd stores its last sync date by touching a file. It only changes metadata, which is why inotify doesn’t see it happening. My solution was to put the following into /etc/tmpfiles.d/zz-clock.conf:

d /run/systemd/timesync 0755 systemd-timesync systemd-timesync - -
f /run/systemd/clock 0644 systemd-timesync systemd-timesync - -
f /run/systemd/timesync/clock 0644 systemd-timesync systemd-timesync - -
L+ /var/lib/systemd/timesync/clock - - - - /run/systemd/timesync/clock
L+ /var/lib/systemd/clock - - - - /run/systemd/clock

/var/lib/systemd/clock is used by systemd 234 and lower, while /var/lib/systemd/timesync/clock is used by systemd 235 and higher. So the latter will only be needed once you upgrade to Debian 10.

AMD Ryzen Threadripper: NUMA architecture, CPU affinity and HTCondor

At the university lab I work at, we usually get desktop computers with Intel Core i7 Extreme CPUs. They have more memory controllers and more cores than Intel’s mainstream CPUs, which is great for us because we do software development and simulations on these machines. Now that it was time to order some new machines, we decided to check out what AMD offers. The last time AMD’s Athlon series was competitive with Intel in the high-end desktop market was probably in the days before the Core i series was introduced a decade ago. Even in the server/HPC market, AMD’s Opteron series hadn’t been a serious Intel competitor after 2012. Now with the Ryzen, AMD finally has something that beats Intel’s high-end offerings both in absolute price and in price per performance.

Today’s high-end CPU market

As prices on Intel’s high-end chips seem to have been increasing with every generation and after Intel didn’t handle the Spectre/Meltdown disaster very well at all earlier this year, we decided that it really was time to break Intel dominance in our lab. I would actually have liked to get something with a non-x86 architecture (because why not), but the requirement of eight CPU cores and four DDR4 memory channels ruled out pretty much everything. The remaining ones were either eliminated based on price (the IBM POWER9 CPU as in the Talos II) or because they are not available in the market (ARM in the form of the Cavium ThunderX2 or Qualcomm Centriq). The POWER9 and ThunderX2 actually have eight memory channels (the Talos’s mainboard only provides access to four though), as does the AMD Epyc (the server version of the Threadripper), so they or their successors might still become interesting options in the future. For comparison, Intel’s current Xeon Scalable series only has six channels, as does the Centriq.

The AMD Ryzen family

We decided to order a Threadripper 1950X with 16 cores. When I unpacked it and started my first benchmark simulation, I was pretty disappointed by the performance though. It turned out that the Threadripper is a NUMA architecture, but you need to toggle a BIOS option (set Memory Interleave to Channel) before it actually presents itself as such to the operating system. In its default mode, memory latencies are very high because a process might be running on a CPU core that is using memory on the other pair of memory controllers. The topology it presents in NUMA mode is roughly like this: out of the 16 cores, four cores each share a common L3 cache to make what AMD calls a core complex (CCX). Two CCXes together share a two-channel memory controller and sit on the same die. Two dies are interconnected with a 50 GB/s link. In NUMA mode and after setting the correct CPU affinity on my simulation processes (which is what the remaining sections of this blog post will be about), I was getting the expected performance — and as you might have expected, the Threadripper really is powerful. You can get something comparable from Intel, the Core i9 Extreme, but these chips are extreme not only in name but also in price. Also, I love the simplicity of AMD’s product lineup: the Ryzen series has two memory channels and 4-8 cores (i.e. one CCX), the Ryzen Threadripper has four memory channels and 8-16 cores (two CCX) and the Epyc has eight memory channels and up to 32 cores. Don’t bother with the second-generation Threadripper that were just announced: the 2920X is almost identical to the 1920X and the 2950X to the 1950X. The 2970WX and 2990WX are really odd chips: they have four CCXes (like the Epyc), but two of them don’t have their own memory controllers. So half of these chips would be as fast as my 1950X in NUMA mode, and the other half would be slower than my 1950X in its default mode.

CPU affinity for MPI

Usually, you only have to worry about CPU pinning and process affinity on servers, high-performance compute clusters and some workstations. Since AMD introduced the Opteron and Intel introduced the Core architecture, machines with multiple processor sockets have been associating memory controllers with CPUs in such a way that memory access within a socket was fast and between sockets was slower. This is called NUMA. To minimize inter-socket memory accesses, you need to tell your operating system’s scheduler to never move threads or processes between cores. This is called pinning or affinity. On servers, it is usually taken care of by the application, while on HPC clusters the admin configures the MPI library to do the right thing. The Threadripper is probably the first chip that brings NUMA to the desktop market. 

If you are using OpenMPI, you just need to set two environment variables

export OMPI_MCA_hwloc_base_binding_policy=numa
export OMPI_MCA_rmaps_base_mapping_policy=numa

so that processes are pinned (bound) to a NUMA domain (usually that’s a socket, but in the Threadripper it’s a die with two CCXes). The second variable tells OpenMPI to create (map) processes on alternating NUMA domains — so the first one is put on the first die, the second one on the second die, the third one on the first die, and so on.

You can also use the -bind-to core and -map-by core command-line arguments to mpiexec/mpirun to achieve the same. Using l3cache  instead of numa may actually get you some even better performance because latencies within a CCX are lower than between CCXes within a die, but that only gains you a few percent and requires that you use more processes and less threads, which may be suboptimal for some software. Below is an interesting figure about latency between cores, as measured with this tool (I’m not sure about the yellow squares though — I would have expected them to be more turquoise, and they don’t seem to match what I observe in actual performance).

If you are using Intel MPI (some commercial software we use does that), you only need

export I_MPI_PIN_DOMAIN=numa

to get pinning. The process creation already happens on alternating domains by default.

I like to put these variables in a file in /etc/profile.d so that they are automatically set for everyone who logs into these machines. I didn’t know this before, but both the GNU and the Intel OpenMP library will only create one thread for each core from their affinity mask, so you don’t even need to set OMP_NUM_THREADS=8 manually if you are running hybrid-parallelized codes.

CPU affinity with HTCondor

Our machines run 24 hours a day, 365 days a year. When nobody is using them locally or via SSH, they run simulation jobs via the HTCondor job scheduler. Of course, we want to make good use of the resources with these jobs too, so we need CPU pinning as well. Usually, we set

NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=100%
SLOT_TYPE_1_PARTITIONABLE = true

in our Condor configuration so that people can decide themselves how many CPUs they want for their jobs. For the Threadripper, it doesn’t make sense to use more than 8 cores per job because we don’t want jobs to cross NUMA domains. This means we need two slots with 50% of the CPU cores, and we want to set SLOT<n>_CPU_AFFINITY to pin the processes to the die. I wrote a Jinja2 template that creates this Condor configuration:

{%- set nodes = salt.file.find('/sys/devices/system/node', name='node[0-9]*', type='d') -%}
NUM_SLOTS = {{ nodes | count }}
ENFORCE_CPU_AFFINITY = True

{% for node in nodes -%}
NUM_SLOTS_TYPE_{{ loop.index }} = 1
{%- set cpus = salt.file.find(node, name='cpu[0-9]*', type='d') -%}
{%- set physical_cpus = [] -%}
{% for cpu in cpus %}
{%- set cpu_id = cpu.replace(node + '/cpu', '') -%}
{%- set physical_id = salt.file.read(cpu + '/topology/core_id') -%}
{% if cpu_id.strip() == physical_id.strip() %}
{%- do physical_cpus.append(cpu) -%}
{% endif -%}
{% endfor %}
SLOT_TYPE_{{ loop.index }} = cpus={{ physical_cpus | count }}
SLOT_TYPE_{{ loop.index }}_PARTITIONABLE = true
{%- set cpu_ids = [] -%}
{% for cpu in cpus %}
{%- do cpu_ids.append(cpu.replace(node + '/cpu', '')) -%}
{% endfor %}
SLOT{{ loop.index }}_CPU_AFFINITY = {{ cpu_ids | join(',')}}

{% endfor -%}

If you don’t have a Jinja2-based configuration manager like SaltStack, you can use the following Python script to render the template:

import os, sys
from jinja2 import Environment, FileSystemLoader
import salt
import salt.modules.file

salt.file = salt.modules.file

file_loader = FileSystemLoader(os.path.dirname(__file__))
env = Environment(loader=file_loader, extensions=['jinja2.ext.do'])
template = env.get_template(sys.argv[1])
template.globals['salt'] = salt
print (template.render())

This produces

NUM_SLOTS = 2
ENFORCE_CPU_AFFINITY = True

NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=8
SLOT_TYPE_1_PARTITIONABLE = true
SLOT1_CPU_AFFINITY = 0,1,16,17,18,19,2,20,21,22,23,3,4,5,6,7

NUM_SLOTS_TYPE_2 = 1
SLOT_TYPE_2 = cpus=8
SLOT_TYPE_2_PARTITIONABLE = true
SLOT2_CPU_AFFINITY = 10,11,12,13,14,15,24,25,26,27,28,29,30,31,8,9

One additional complication is that some of our users like to set getenv = True in their condor submit file. This means that the variables we set in the global shell profile in the previous section are inherited to the Condor job, overriding the affinity we set for the slots. So we need a job wrapper script that removes these variables if present. Put

#!/bin/bash

for var in $(/usr/bin/env | /bin/grep -E '^(OMPI_|I_MPI_)' | /usr/bin/awk -F = {print $1}'); do
    echo "Removing environment variable $var" >&2
    unset "$var"
done

exec "$@"
error=$?
echo "Failed to exec($error): $@" > $_CONDOR_WRAPPER_ERROR_FILE
exit 1

into /usr/lib/condor/libexec/condor_pinning_wrapper.sh and add

USER_JOB_WRAPPER = /usr/lib/condor/libexec/condor_pinning_wrapper.sh

to your Condor configuration.

Filtering outgoing traffic with VirtualBox’s NAT interface

Hypervisors like VMware Workstation, VMWare Fusion or VirtualBox usually offer three kinds of network interfaces: bridged (to a network on the host), NAT (sharing an IP address with the host via network address translation) and host-only (a connection exclusively between host and guest).

VMWare, at least on Linux, realizes NAT entirely in the kernel, using standard IP forwarding and setting up a DHCP server on the host that gives addresses to the guest. VirtualBox, on the other hand, handles NAT entirely in user-space, meaning all packets entering and leaving the VM really seem to be going to and from a process named VBoxHeadless or similar.

If you want to limit what kinds of connections a guest system can make, VMware lets you do that quite easily by adding rules to the FORWARD chain:

sudo iptables -I FORWARD -i vmnet8 -j ACCEPT -d github.com
sudo iptables -I FORWARD -i vmnet8 -j REJECT

Unfortunately, things are a lot more difficult with VirtualBox. Since there is no dedicated interface, you need to filter based on process. iptables can’t do that directly, but you can use cgroups to add the necessary marks to the packets:

sudo mkdir /sys/fs/cgroup/net_cls/virtualbox
echo 86 | sudo tee /sys/fs/cgroup/net_cls/virtualbox/net_cls.classid
sudo iptables -N VIRTUALBOX
sudo iptables -I OUTPUT -j VIRTUALBOX -m cgroup --cgroup 86
sudo iptables -I VIRTUALBOX -j ACCEPT -d github.com
sudo iptables -I VIRTUALBOX -j REJECT

Of course, you now need to create the VirtualBox process inside that cgroup. One way is

sudo apt-get install cgroup-tools
sudo cgexec -g net_cls:virtualbox vboxmanage startvm WindowsXP --type=headless

or you can use a wrapper script instead of vboxmanage:

#!/bin/sh -e
CGROUP_NAME=virtualbox
CGROUP_ID=86

if [ ! -d /sys/fs/cgroup/net_cls/$CGROUP_NAME/ ]; then
  mkdir /sys/fs/cgroup/net_cls/$CGROUP_NAME
  echo $CGROUP_ID > /sys/fs/cgroup/net_cls/$CGROUP_NAME/net_cls.classid
fi

/bin/echo $$ > /sys/fs/cgroup/net_cls/$CGROUP_NAME/tasks

exec /usr/bin/vboxmanage "$@"

Using a BIND DNS server in an Active Directory Environment

Years ago, I posted a script that allowed ISC DHCPd to update a Microsoft DNS server with dynamic records for DHCP clients. I haven’t used that method in a long time and there is a much simpler method: use ISC DHCPd together with the BIND DNS server like everybody else does, and only delegate the _mscds and _sites zones from the BIND server to the Microsoft DNS servers:

_msdcs.example.com. 86400 IN NS dc01.example.com.
_msdcs.example.com. 86400 IN NS dc02.example.com.
_sites.example.com. 86400 IN NS dc01.example.com.
_sites.example.com. 86400 IN NS dc02.example.com.

Then on all your machines, use the BIND server as DNS server (typically set via DHCP option 23). For Windows Domain matters, only records below _msdcs and _sites are ever looked up.

I believe you should even be able to point your domain controllers to the BIND DNS server — they should be able to follow the NS record so that whenever they try to update their own records, they do so on the Microsoft DNS server. As it turns out, the RFC 2136 DNS UPDATE method is used when domain controllers try to register their own records, so you’ll see error messages in your logs if you point your domain controllers to the BIND DNS server (on a Microsoft DC, these would refer to NETLOGON and dynamic DNS registrations, while on a Samba DC they would be about samba_dnsupdate). If you are running Samba 4.5 or higher, you should ensure that samba_dnsupdate is called with the –use-samba-tool flag, which can probably be done by setting the option below in your /etc/samba/smb.conf. If you are running an older Samba version or any Windows Server version, you need to resort to using your domain controllers’ IP addresses as DNS servers on on all domain controllers (Samba: put them into /etc/resolv.conf, Windows: set them in the network interface properties).

dns update command = /usr/sbin/samba_dnsupdate --use-samba-tool

For compatibility with Unix clients (including Mac OS X), you’ll want to add a couple of CNAME records for the SRV records:

_ldap._tcp.example.com. 86400 IN CNAME _ldap._tcp.dc._msdcs.example.com.
_gc._tcp.example.com. 86400 IN CNAME _ldap._tcp.gc._msdcs.example.com.
_kerberos._tcp.example.com. 86400 IN CNAME _kerberos._tcp.dc._msdcs.example.com.
_kerberos._udp.example.com. 86400 IN CNAME _kerberos._tcp.dc._msdcs.example.com.
_kpasswd._tcp.example.com. 3600 IN SRV 0 100 464 dc01.example.com
_kpasswd._tcp.example.com. 3600 IN SRV 0 100 464 dc02.example.com
_kpasswd._udp.example.com. 3600 IN SRV 0 100 464 dc01.example.com
_kpasswd._udp.example.com. 3600 IN SRV 0 100 464 dc02.example.com

The _kpasswd records unfortunately can’t be CNAMEs because they don’t exist in the _msdcs branch, so you manually need to keep them up-to-date when you add and remove domain controllers.

Boot a Windows install disc from the network using iPXE and wimboot

A while ago, I showed how you can use a Linux PXE server along with a tool called Serva to PXE boot a Windows installer DVD. By now, there is a much nicer solution available that doesn’t require any Windows tools: iPXE with wimboot. So go ahead and replace your PXELinux setup with iPXE first. Then copy the contents of a Windows installer DVD to your TFTP server and make sure that the folder is also shared read-only via SMB. Now copy the wimboot binary to your TFTP server and add something like the following to your iPXE config file:
set serverip 192.168.200.29
set tftpboot tftp://${serverip}/
set tftpbootpath /mnt/Daten/tftpboot

:menu
menu iPXE boot menu
item --key w win10de Windows 10 16.07 x64 German
choose os
goto ${os}

:win10de
echo Booting Windows Installer...
set root-path ${tftpboot}/ipxe
kernel ${root-path}/wimboot gui
set root-path ${tftpboot}/Win10_1607_German_x64
initrd ${root-path}/boot/bcd BCD
initrd ${root-path}/boot/boot.sdi boot.sdi
initrd ${root-path}/sources/boot.wim boot.wim
initrd ${root-path}/boot/fonts/segmono_boot.ttf segmono_boot.ttf
initrd ${root-path}/boot/fonts/segoe_slboot.ttf segoe_slboot.ttf
initrd ${root-path}/boot/fonts/segoen_slboot.ttf segoen_slboot.ttf
initrd ${root-path}/boot/fonts/wgl4_boot.ttf wgl4_boot.ttf
boot || goto failed

That’s it. When you boot this boot menu entry, you’ll be presented with the Windows installer, but if you click through it, it will at some point ask you for a driver because it can’t find its installer packages. So before you click through, hit Shift-F10 and execute the following commands to set up the network, mount the SMB share and re-execute the installer:
wpeinit
net use s: \\192.168.200.29\tftpboot\Win10_1607_German_x64 bar /user:foo
s:\sources\setup.exe

If your SMB server is running Samba, the user you specify (foo) must not exist on the server so you force it to use anonymous authentication. With a Windows server, things might be different.

That should give you a working installer that will get your Windows running within a few minutes because Gigabit Ethernet has a much bigger bandwidth than a spinning DVD or a cheap USB flash drive.

Serva did have one advantage, however: when you set it up, you could inject network drivers into the boot image. With Windows 10, luckily, that has become a non-issue: Microsoft releases a new installer ISO for it about once a year, which you can directly download and which should contain all drivers for the latest hardware available when it was released.

Wiring Fibre Channel for Arbitrated Loop

We have a small Fibre Channel SAN with three servers, a switch and a dual-controller RAID enclosure. With only a single switch, we obviously couldn’t connect all servers redundantly to the RAID system. That meant, for example, that firmware updates to it could only be applied after shutting down the servers. Buying a second switch was hard to justify for this simple setup, so we decided to hook the switch up to the first RAID controller and wire a loop off the second RAID controller. Each server would have one port connected to the switch and another one to the loop.

Back in the old days, Fibre Channel hubs existed for exactly this purpose, but nowadays you simply can’t get them anymore. However, in a redundant setup, you don’t need a hub, you can simply run cables appropriately, i.e. in a daisy-chain fashion. The only downside of not having a hub, the loop going down if one server goes down, is irrelevant here because you have a second path via the switch. For technical details on the loop topology, you can have a look at documentation available from EMC.

To wire your servers and RAID controllers in a fashion resembling a hub (only without the automatic bypass if a server goes down), you need simplex patch cables. These consist of a single fibre (instead of two, like you are used to) and have a connector that looks like half of a regular LC connector. You can get them in the same qualities as regular (duplex) patch cables and in single- and multi-mode as needed. They look like this:

These cables are somewhat exotic, so your usual cable dealer might not have them, but they exist and are available from specialized fiber cable dealers. As you are wiring in a daisy-chain fashion, you need one simplex cable per node you want to connect.

Once you have the necessary cables, you wire everything by connecting a cable from the output side of the FC port on the first device and connecting it to the input side of the FC port on the second device. The second device’s output side connects to the third device’s input side and so on, until you have made a full loop by connecting the last device’s output side to the first device’s input side, like this:

How can you tell the output side from the input side? Some transceivers have little arrows or have labels “TX” (output) and “RX” (input). If yours don’t, you can recognize the output side by the red laser light coming from it. To protect your eyes, never look directly into the laser light. Never connect two output sides together, otherwise you may damage the laser diodes in both of them. Therefore, be careful and always double-check.

That’s it, you’re done. All servers on the loop should immediately see the LUNs exported to the by the RAID controller.

In my setup however, further troubleshooting was required. The loop simply did not come up. As it turns out, someone had set the ports on one FC HBA to point-to-point-only and 2 Gbit/s mode. After switching both these settings to their default automatic mode, the loop went up and data started flowing. My HBAs are QLogic QLE2462 (4 Gbit/s generation) and QLE2562 (8 Gbit/s generation), and they automatically negotiated the fastest common speed they could handle, which is 4 Gbit/s. Configuring these parameters on the HBA usually requires hitting a key at the HBA’s pre-boot screen to enter the configuration menu and doing it there, or via vendor-specific software. I didn’t have access to the pre-boot configuration menu and was running VMware ESXi 6.0 on the servers. The QConvergeConsole for my QLogic adapters luckily is also available for VMware. It is not as easy to install as on Windows or Linux, unfortunately. You first install a CIM provider via the command line on the ESXi host, reboot the host and then install a plugin into VMware vCenter server.

Snom VoIP phones: OpenVPN and multicast audio

Snom voice-over-IP phones have a built-in OpenVPN client, and they can also have audio transmitted to them via multicast. However, in my case a phone logged into OpenVPN did not play back audio received from the local ethernet network

I had multicast reception configured correctly on the phone:

multicast_listen=on
mc_address=239.255.255.245:5555

(Remember that you need to push the Apply button at the bottom of the page and then the Save button at the top of the page for the setting to actually take effect.)

To diagnose the problem, you can use multicast pings. Running

ping -I eth0 -t 2 239.255.255.245

on the local network showed no reactions from the phone. However, running

ping -I tun2 -t 2 239.255.255.245

on the VPN server showed responses from the phone. So clearly the phone wanted to do multicast exclusively via VPN. The cause of this was the following line in my OpenVPN config on the server:

push "redirect-gateway def1"

The obvious solution was to replace this line with routes that excluded the multicast IP addresses:

push "route 0.0.0.0 128.0.0.0"
push "route 128.0.0.0 192.0.0.0"
push "route 192.0.0.0 224.0.0.0"

While this worked, the phone wouldn’t complete booting up the next time I power-cycled it. So I switched back to the old OpenVPN config and instead turned on multicast routing on the VPN server:

apt-get install smcroute
echo 'mgroup from eth0 group 239.255.255.245' >> /etc/smcroute.conf
echo 'mroute from eth0 group 239.255.255.245 to tun0 tun1 tun2' >> /etc/smcroute.conf
service smcroute restart

Multicast audio now works and it even goes across the VPN.

Below you can find some commands that use FFMPEG to send the multicast audio:

ffmpeg -re -i song.mp3 -filter_complex 'aresample=16000,asetnsamples=n=160' -acodec g722 -ac 1 -vn -f rtp udp://239.255.255.245:5555
ffmpeg -re -i song.mp3 -filter_complex 'aresample=8000,asetnsamples=n=160' -acodec pcm_mulaw -ac 1 -vn -f rtp udp://239.255.255.245:5555
ffmpeg -re -i song.mp3 -filter_complex 'aresample=8000,asetnsamples=n=160' -acodec pcm_alaw -ac 1 -vn -f rtp udp://239.255.255.245:5555

Interestingly, you don’t even need to set the rtp_codec_type setting — the phone automatically determines the codec from the stream. I wasn’t able to get the Opus codec working though, the phone just makes crackling noises when I tried.

Triple-booting a Clamshell iBook

I have a first-generation Clamshell iBook, manufactured in January 2000. These shipped with 64 MB of on-board RAM, a 6 GB hard drive and a CD drive.
I got it in summer 2005 when my high school decommissioned these computers. Running Mac OS X 10.3.9 at the time, it was still a very usable computer, especially after I upgraded the RAM to 320 MB. I used it as my main computer for a year and a half, until I replaced it with a new Intel-based Mac.

Now, in 2016, I found a 40 GB IDE laptop hard drive and decided to put that into the Clamshell and try to make it usable again.
As usual, iFixit has a good tutorial. You have to take the entire machine apart and end up with around 50 screws, but it’s surprisingly easy. You only need a Torx T8, a Philips #1 and a 5mm nut driver.
Now, on to putting some operating systems on the hard drive.

Installing Mac OS X 10.4

Here is what you need:

  • Mac OS X 10.4 retail install DVD
  • Mac OS X 10.3 retail install CD (the first one is sufficient). Any other bootable Mac OS X 10.3 CD might also work.
  • USB hard drive

Officially, Tiger is only supported on the latest revision of the Clamshell iBook, which has FireWire and a DVD drive. Mine only has a CD drive and does not support USB booting, which makes things exceedingly difficult.

  1. Take an image of the 10.4 install DVD and save it onto the USB hard drive using a different Mac.
  2. Boot the Clamshell off the 10.3 install CD
  3. Open up Disk Utility. Partition the disk with the following sizes and names: 16 MB linux-boot, 2 GB linux-swap, 12 GB linux, 20 GB Macintosh HD, 6 GB Install. Make sure to select the Install OS 9 drivers checkbox.
  4. Plug in the USB hard drive.
  5. Clone the 10.4 install image onto the Install partition on the internal drive.
  6. Open a Terminal and bless the hard drive: bless --folder "/Volumes/Install/System/Library/CoreServices" --bootinfo "/Volumes/Install/usr/standalone/ppc/bootx.bootinfo".
  7. Using the Terminal, pico /Volumes/Install/System/Installation/Packages/OSInstall.mpkg/Contents/OSInstall.dist and remove the PowerBook2,1 from the badMachines list.
  8. Open Startup Disk and set the iBook to boot from the Install partition.
  9. Reboot
  10. Install 10.4 onto the Macintosh HD partition, deselecting all printer drivers and languages you don’t need in order to save some space.
  11. After the reboot, run Software Update a few times.

Installing Mac OS 9.2.2

Here is what you need:

Mac OS 9 is really easy to install, you just copy it to the hard drive.

  1. Mount the CD or image and copy System Folder and Applications (Mac OS 9) from the root directory to your Macintosh HD.
  2. Open the Startup Disk preference pane and select Mac OS 9.2.2.
  3. Reboot.
  4. If you used the NetBoot image, you will be asked for user name and password. Use NBUser and netboot.
  5. In order to disable the password prompt, go to the hard drive, System Folder and move the contents of Control Panels disabled to Control Panels. Now open the Multiple Users control panel, go to options and set it to local authentication. Then, disable multiple users entirely.
  6. Download and install QuickTime 6.0.3 for Mac.
  7. Open the Startup Disk control panel and select Mac OS 10.4.
  8. After the reboot, open the Classic preference pane and select the system folder on Macintosh HD.

Now you can use Classic and you can also do a native boot.

Installing Debian Linux 8 “Jessie”

Mac OS 9.2.2 hasn’t received updates in almost 15 years, and Mac OS X 10.4 is also 7 years beyond its update cycle. So how about a current operating system?

Here is what you need:

Except for the partitioning, the installation is quite straight-forward.

  1. Insert the CD and boot it by holding the C key while turning on the computer.
  2. Proceed through the installer.
  3. When asked for the partitioning, set the 16 MB partition to be NewWorld Boot and set the bootable flag. Set the 2 GB partition to be swap. Set the 12 GB partition to be ext3 and mounted at /.
  4. When asked for the packages to install, choose the Xfce desktop. It’s lightweight enough to run on this old hardware.
  5. After rebooting, run nano /etc/apt/sources.list and add the non-free repository to all entries in the file. Then, install the AirPort firmware: apt-get install linux-firmware-nonfree.

The resulting Debian install mostly worked for me, except for two things:

  • the AirPort card does not show up in NetworkManager
  • Fonts are a bit messed up. Bitmaps are drawn to the screen just fine.

I’m still looking for solutions to these two things.

Configuring the yaboot bootloader

Right now, your computer will boot into Linux all the time.
While the yaboot loader prompts you to hit x to boot into Mac OS X and m to boot into Mac OS 9, that doesn’t work as expected.
Instead of booting into Mac OS X, my iBook booted into the install partition.
And instead of booting into Mac OS 9, my iBook booted into Mac OS X.

So run nano /etc/yaboot.conf and adjust the macosx= line so that it refers to the same partition as the macos9= line. Exit the editor and run ybin -v to apply the changes. That fixes the first problem.

Now, reboot into Mac OS X and open the Startup Disk preference pane. Select Mac OS 9.2.2 and reboot.
While rebooting, hold down the option key and select the Linux partition at the boot picker.
Run nano /etc/yaboot.conf again and add brokenosx to a new line. Once more, run ybin -v.

Now, each boot loader entry does what you’d expect. The trick here is that brokenosx causes yaboot to directly load the Mac OS X booter for the macosx= entry. The macos= entry, on the other hand, will still cause the blessed system folder to be booted.

Fixing Mac OS 9 after installing Linux

One problem is still there: When you try to boot Mac OS 9, you are greeted by a blinking floppy with a question mark. This happens because the Debian partitioner destroys the Mac OS 9 drivers for the HFS+ partitions. However, the drivers can be reinstalled.

You need:

  • Mac OS 9 install CD. Any other bootable Mac OS 9 CD might also work.

Now,

  1. Insert the CD and reboot the iBook while holding down the C key.
  2. Open Drive Setup, highlight the internal hard drive, go to the Functions menu and click Update Drivers.
  3. Reboot

Resetting the Startup Disk

Whenever you touch the Startup Disk preference pane in Mac OS X or the Startup Disk control panel in Mac OS 9, your system will no longer show the yaboot prompt when you turn it on. To fix that, do the following:

  1. Hold down the option key while booting and select the Linux partition
  2. Run ybin -v

Fixing Linux problems

http://ppcluddite.blogspot.de/2012/03/installing-debian-linux-on-ppc-part-iv.html is an excellent article that explains how to fix most issues that Debian has on PowerPC. For example, to fix the font rendering troubles, create an xorg.conf file (switch to a text terminal, run init 3, Xorg -configure, cp /root/xorg.conf.new /etc/X11/xorg.conf) and insert Option "RenderAccel" "false" into its Device section.

ARP and multicast packets lost with OpenVPN in tap mode

After upgrading our OpenVPN server VM from Debian 7 to Debian 8 (moving us from OpenVPN 2.2 to OpenVPN 2.3 and Linux kernel 3.2 to Linux kernel 3.16) and upgrading our virtualization from VMware ESXi 5.5 to ESXi 6.0 and moving the VM to a different host, the VPN got really unreliable: the VPN connection itself worked fine, but any connections established across the VPN were very slow to get established. Once they were established, everything worked fine and you could even create new connections to the same host across the VPN and they would be established quickly.

I wasn’t sure which one of the many changes caused the issue, but luckily Wireshark quickly revealed the problem: As we are using OpenVPN in layer 2 mode (i.e. with tap interfaces), ARP packets are quite important. While I could see the ARP requests making it across the interface bridge from tap0 to eth0, I saw the ARP replies going into eth0 and not making it to tap0. The server-side fix is easy, just disable the MAC table on the bridge completely and simply lets all packets pass:

brctl setageing br0 0

Now that ARP was working, I noticed that VPN clients also did not get IPv6 addresses. Evidently, the ICMPv6 multicasts weren’t making it across the bridge either. To fix that, enable multicast snooping on the bridge:

echo 1 > /sys/devices/virtual/net/br0/bridge/multicast_querier

Update March 2016: A recent kernel update in Debian Jessie appears to have changed the multicast bridging behavior. I now need to disable multicast snooping:

echo 0 > /sys/devices/virtual/net/br0/bridge/multicast_querier

Fixing OpenMPI over InfiniBand on Rocks Cluster Linux

We recently got a new small compute cluster at the university, running Rocks Clusters Linux 6.1.1, a CentOS 6 derivative. The nodes are interconnected via an InfiniBand network. Unfortunately, the default configuration of OpenMPI 1.6.2 in the HPC roll wastes a significant amount of performance: it communicates using TCP, which is run over a load-balanced combination of IP over InfiniBand and IP over Ethernet.

Switching to DMA over InfiniBand is simple: just run the following command on all compute nodes and the head node:

sed -i 's/add rocks-openmpi/add rocks-openmpi_ib/g' /etc/profile.d/rocks-hpc.*sh

Now however, you get a message like this when you run an MPI job:

--------------------------------------------------------------------------
WARNING: It appears that your OpenFabrics subsystem is configured to only
allow registering part of your physical memory.  This can cause MPI jobs to
run with erratic performance, hang, and/or crash.

This may be caused by your OpenFabrics vendor limiting the amount of
physical memory that can be registered.  You should investigate the
relevant Linux kernel module parameters that control how much physical
memory can be registered, and increase them to allow registering all
physical memory on your machine.

See this Open MPI FAQ item for more information on these Linux kernel module
parameters:

    http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages

  Local host:              bee.icp.uni-stuttgart.de
  Registerable memory:     32768 MiB
  Total memory:            130967 MiB

Your MPI job will continue, but may be behave poorly and/or hang.
--------------------------------------------------------------------------

To fix that, run

echo "options mlx4_core log_num_mtt=24" >> /etc/modprobe.d/mlx4.conf

on all nodes and reboot. log_mtts_per_seg defaulted to 3 on our kernel and did not need tweaking. To check your current values, run

grep . /sys/module/mlx4_core/parameters/*mtt*

One warning message that still comes up when running an MPI job is the following:

--------------------------------------------------------------------------
WARNING: Failed to open "OpenIB-cma-1" [DAT_INVALID_ADDRESS:]. 
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--------------------------------------------------------------------------
bee.icp.uni-stuttgart.de:30104:  open_hca: getaddr_netdev ERROR: No such device. Is ib1 configured?
bee.icp.uni-stuttgart.de:30104:  open_hca: device mthca0 not found
bee.icp.uni-stuttgart.de:30104:  open_hca: device mthca0 not found
DAT: library load failure: libdaplscm.so.2: cannot open shared object file: No such file or directory
DAT: library load failure: libdaplscm.so.2: cannot open shared object file: No such file or directory

As UDAPL is removed in newer OpenMPI versions anyway, this is fixed by running

echo "btl = ^udapl" >> /opt/openmpi/etc/openmpi-mca-params.conf

on all compute nodes and the head node.

So all in all, you can simply add the following lines to /export/rocks/install/site-profiles/6.1.1/nodes/extend-compute.xml and rebuild your compute node image:

echo "btl = ^udapl" >> /opt/openmpi/etc/openmpi-mca-params.conf
sed -i 's/add rocks-openmpi/add rocks-openmpi_ib/g' /etc/profile.d/rocks-hpc.*sh
echo "options mlx4_core log_num_mtt=24" >> /etc/modprobe.d/mlx4.conf
dracut -f 2.6.32-504.16.2.el6.x86_64 # may need to rebuild the initrd so it picks up the modprobe parameters