Author Archives: Michael Kuron

AMD Ryzen Threadripper: NUMA architecture, CPU affinity and HTCondor

At the university lab I work at, we usually get desktop computers with Intel Core i7 Extreme CPUs. They have more memory controllers and more cores than Intel’s mainstream CPUs, which is great for us because we do software development and simulations on these machines. Now that it was time to order some new machines, we decided to check out what AMD offers. The last time AMD’s Athlon series was competitive with Intel in the high-end desktop market was probably in the days before the Core i series was introduced a decade ago. Even in the server/HPC market, AMD’s Opteron series hadn’t been a serious Intel competitor after 2012. Now with the Ryzen, AMD finally has something that beats Intel’s high-end offerings both in absolute price and in price per performance.

Today’s high-end CPU market

As prices on Intel’s high-end chips seem to have been increasing with every generation and after Intel didn’t handle the Spectre/Meltdown disaster very well at all earlier this year, we decided that it really was time to break Intel dominance in our lab. I would actually have liked to get something with a non-x86 architecture (because why not), but the requirement of eight CPU cores and four DDR4 memory channels ruled out pretty much everything. The remaining ones were either eliminated based on price (the IBM POWER9 CPU as in the Talos II) or because they are not available in the market (ARM in the form of the Cavium ThunderX2 or Qualcomm Centriq). The POWER9 and ThunderX2 actually have eight memory channels (the Talos’s mainboard only provides access to four though), as does the AMD Epyc (the server version of the Threadripper), so they or their successors might still become interesting options in the future. For comparison, Intel’s current Xeon Scalable series only has six channels, as does the Centriq.

The AMD Ryzen family

We decided to order a Threadripper 1950X with 16 cores. When I unpacked it and started my first benchmark simulation, I was pretty disappointed by the performance though. It turned out that the Threadripper is a NUMA architecture, but you need to toggle a BIOS option (set Memory Interleave to Channel) before it actually presents itself as such to the operating system. In its default mode, memory latencies are very high because a process might be running on a CPU core that is using memory on the other pair of memory controllers. The topology it presents in NUMA mode is roughly like this: out of the 16 cores, four cores each share a common L3 cache to make what AMD calls a core complex (CCX). Two CCXes together share a two-channel memory controller and sit on the same die. Two dies are interconnected with a 50 GB/s link. In NUMA mode and after setting the correct CPU affinity on my simulation processes (which is what the remaining sections of this blog post will be about), I was getting the expected performance — and as you might have expected, the Threadripper really is powerful. You can get something comparable from Intel, the Core i9 Extreme, but these chips are extreme not only in name but also in price. Also, I love the simplicity of AMD’s product lineup: the Ryzen series has two memory channels and 4-8 cores (i.e. one CCX), the Ryzen Threadripper has four memory channels and 8-16 cores (two CCX) and the Epyc has eight memory channels and up to 32 cores. Don’t bother with the second-generation Threadripper that were just announced: the 2920X is almost identical to the 1920X and the 2950X to the 1950X. The 2970WX and 2990WX are really odd chips: they have four CCXes (like the Epyc), but two of them don’t have their own memory controllers. So half of these chips would be as fast as my 1950X in NUMA mode, and the other half would be slower than my 1950X in its default mode.

CPU affinity for MPI

Usually, you only have to worry about CPU pinning and process affinity on servers, high-performance compute clusters and some workstations. Since AMD introduced the Opteron and Intel introduced the Core architecture, machines with multiple processor sockets have been associating memory controllers with CPUs in such a way that memory access within a socket was fast and between sockets was slower. This is called NUMA. To minimize inter-socket memory accesses, you need to tell your operating system’s scheduler to never move threads or processes between cores. This is called pinning or affinity. On servers, it is usually taken care of by the application, while on HPC clusters the admin configures the MPI library to do the right thing. The Threadripper is probably the first chip that brings NUMA to the desktop market. 

If you are using OpenMPI, you just need to set two environment variables

export OMPI_MCA_hwloc_base_binding_policy=numa
export OMPI_MCA_rmaps_base_mapping_policy=numa

so that processes are pinned (bound) to a NUMA domain (usually that’s a socket, but in the Threadripper it’s a die with two CCXes). The second variable tells OpenMPI to create (map) processes on alternating NUMA domains — so the first one is put on the first die, the second one on the second die, the third one on the first die, and so on.

You can also use the -bind-to core and -map-by core command-line arguments to mpiexec/mpirun to achieve the same. Using l3cache  instead of numa may actually get you some even better performance because latencies within a CCX are lower than between CCXes within a die, but that only gains you a few percent and requires that you use more processes and less threads, which may be suboptimal for some software. Below is an interesting figure about latency between cores, as measured with this tool (I’m not sure about the yellow squares though — I would have expected them to be more turquoise, and they don’t seem to match what I observe in actual performance).

If you are using Intel MPI (some commercial software we use does that), you only need

export I_MPI_PIN_DOMAIN=numa

to get pinning. The process creation already happens on alternating domains by default.

I like to put these variables in a file in /etc/profile.d so that they are automatically set for everyone who logs into these machines. I didn’t know this before, but both the GNU and the Intel OpenMP library will only create one thread for each core from their affinity mask, so you don’t even need to set OMP_NUM_THREADS=8 manually if you are running hybrid-parallelized codes.

CPU affinity with HTCondor

Our machines run 24 hours a day, 365 days a year. When nobody is using them locally or via SSH, they run simulation jobs via the HTCondor job scheduler. Of course, we want to make good use of the resources with these jobs too, so we need CPU pinning as well. Usually, we set

SLOT_TYPE_1 = cpus=100%

in our Condor configuration so that people can decide themselves how many CPUs they want for their jobs. For the Threadripper, it doesn’t make sense to use more than 8 cores per job because we don’t want jobs to cross NUMA domains. This means we need two slots with 50% of the CPU cores, and we want to set SLOT<n>_CPU_AFFINITY to pin the processes to the die. I wrote a Jinja2 template that creates this Condor configuration:

{%- set nodes = salt.file.find('/sys/devices/system/node', name='node[0-9]*', type='d') -%}
NUM_SLOTS = {{ nodes | count }}

{% for node in nodes -%}
NUM_SLOTS_TYPE_{{ loop.index }} = 1
{%- set cpus = salt.file.find(node, name='cpu[0-9]*', type='d') -%}
{%- set physical_cpus = [] -%}
{% for cpu in cpus %}
{%- set cpu_id = cpu.replace(node + '/cpu', '') -%}
{%- set siblings = + '/topology/thread_siblings_list').strip().split(',') -%}
{% if cpu_id == siblings[0] %}
{%- do physical_cpus.append(cpu) -%}
{% endif -%}
{% endfor %}
SLOT_TYPE_{{ loop.index }} = cpus={{ physical_cpus | count }}
SLOT_TYPE_{{ loop.index }}_PARTITIONABLE = true
{%- set cpu_ids = [] -%}
{% for cpu in cpus %}
{%- do cpu_ids.append(cpu.replace(node + '/cpu', '')) -%}
{% endfor %}
SLOT{{ loop.index }}_CPU_AFFINITY = {{ cpu_ids | join(',')}}

{% endfor -%}

If you don’t have a Jinja2-based configuration manager like SaltStack, you can use the following Python script to render the template:

import os, sys
from jinja2 import Environment, FileSystemLoader
import salt
import salt.modules.file

salt.file = salt.modules.file

file_loader = FileSystemLoader(os.path.dirname(__file__))
env = Environment(loader=file_loader, extensions=[''])
template = env.get_template(sys.argv[1])
template.globals['salt'] = salt
print (template.render())

This produces


SLOT_TYPE_1 = cpus=8
SLOT1_CPU_AFFINITY = 0,1,16,17,18,19,2,20,21,22,23,3,4,5,6,7

SLOT_TYPE_2 = cpus=8
SLOT2_CPU_AFFINITY = 10,11,12,13,14,15,24,25,26,27,28,29,30,31,8,9

One additional complication is that some of our users like to set getenv = True in their condor submit file. This means that the variables we set in the global shell profile in the previous section are inherited to the Condor job, overriding the affinity we set for the slots. So we need a job wrapper script that removes these variables if present. Put


for var in $(/usr/bin/env | /bin/grep -E '^(OMPI_|I_MPI_)' | /usr/bin/awk -F = {print $1}'); do
    echo "Removing environment variable $var" >&2
    unset "$var"

exec "$@"
echo "Failed to exec($error): $@" > $_CONDOR_WRAPPER_ERROR_FILE
exit 1

into /usr/lib/condor/libexec/ and add

USER_JOB_WRAPPER = /usr/lib/condor/libexec/

to your Condor configuration.

Update 2019-01: Updated the Jinja template to work correctly on non-16-core Threadrippers. These have one (12-core, 24-core) or two (8-core) cores disabled on each CCX, which results in non-consecutive core IDs.

Groundwire push notifications with private SIP server

Acrobits’ Groundwire is a SIP soft-phone app for iOS. To receive incoming calls while the app is inactive, it either runs in the background (which drains the battery pretty quickly) or uses push notifications. The latter requires you to entrust Acrobits’ servers with your login credentials and only works for SIP servers reachable from the internet, so it is is not an option for private/internal SIP servers.

However, Acrobits offers a push notification HTTP API. This article tells you how to use it for this purpose.

First, you need the app to report its push token to your server. A PHP script like this one can accept this information and store it into your Asterisk server’s database:

$key = $_POST['username'];
$key = filter_var($key, FILTER_SANITIZE_NUMBER_INT);
$value = $_POST['token'];
if ($value == '')
 header('HTTP/1.0 422 Unprocessable Entity', TRUE, 422);
 die('No token provided');
shell_exec("sudo asterisk -rx 'database put GroundwireToken " . escapeshellarg($key) . " " . escapeshellarg($value) . "'");

$value = $_POST['selector'];
shell_exec("sudo asterisk -rx 'database put GroundwireSelector " . escapeshellarg($key) . " " . escapeshellarg($value) . "'");

$value = $_POST['appid'];
shell_exec("sudo asterisk -rx 'database put GroundwireApp " . escapeshellarg($key) . " " . escapeshellarg($value) . "'");

In Groundwire, go to Settings, Accounts, edit your account, Advanced Settings, Web Services, Push Token Reporter. For URL, enter the full URL of the above script. For POST, enter

Second, you need to modify your dialplan to send a push notification that wakes up the app. I created a subroutine for that purpose:

exten => _XX,1,GotoIf(${DB_EXISTS(GroundwireToken/${EXTEN})}]?groundwire)
same => n,Return()
same => n(groundwire),Set(push_data=verb=NotifyTextMessage&Selector=${URIENCODE(${DB(GroundwireSelector/${EXTEN})})}&DeviceToken=${URIENCODE(${DB(GroundwireToken/${EXTEN})})}&AppId=${URIENCODE(${DB(GroundwireApp/${EXTEN})})})
same => n,Noop(Groundwire: ${CURL("",${push_data})})
same => n,Ringing()
same => n,Wait(3) ; give app some time to activate
same => n(done),Return()
exten => i,1,Return()
exten => s,1,Return()

Then, in your dialplan, call that subroutine, like so:

exten => _XX,Noop(Incoming call for ${EXTEN})
same => n,Gosub(sub-push,${EXTEN},1)
same => n,Dial(SIP/${EXTEN})

Because there is no (documented) way to tell the app to dial a number (e.g. one that does PickupChan(SIP/${CALLERID(num)})) when it receives the push notification, we are simply waiting for three seconds for the app to launch and register and then proceed signalling the call.

Filtering outgoing traffic with VirtualBox’s NAT interface

Hypervisors like VMware Workstation, VMWare Fusion or VirtualBox usually offer three kinds of network interfaces: bridged (to a network on the host), NAT (sharing an IP address with the host via network address translation) and host-only (a connection exclusively between host and guest).

VMWare, at least on Linux, realizes NAT entirely in the kernel, using standard IP forwarding and setting up a DHCP server on the host that gives addresses to the guest. VirtualBox, on the other hand, handles NAT entirely in user-space, meaning all packets entering and leaving the VM really seem to be going to and from a process named VBoxHeadless or similar.

If you want to limit what kinds of connections a guest system can make, VMware lets you do that quite easily by adding rules to the FORWARD chain:

sudo iptables -I FORWARD -i vmnet8 -j ACCEPT -d
sudo iptables -I FORWARD -i vmnet8 -j REJECT

Unfortunately, things are a lot more difficult with VirtualBox. Since there is no dedicated interface, you need to filter based on process. iptables can’t do that directly, but you can use cgroups to add the necessary marks to the packets:

sudo mkdir /sys/fs/cgroup/net_cls/virtualbox
echo 86 | sudo tee /sys/fs/cgroup/net_cls/virtualbox/net_cls.classid
sudo iptables -N VIRTUALBOX
sudo iptables -I OUTPUT -j VIRTUALBOX -m cgroup --cgroup 86
sudo iptables -I VIRTUALBOX -j ACCEPT -d
sudo iptables -I VIRTUALBOX -j REJECT

Of course, you now need to create the VirtualBox process inside that cgroup. One way is

sudo apt-get install cgroup-tools
sudo cgexec -g net_cls:virtualbox vboxmanage startvm WindowsXP --type=headless

or you can use a wrapper script instead of vboxmanage:

#!/bin/sh -e

if [ ! -d /sys/fs/cgroup/net_cls/$CGROUP_NAME/ ]; then
  mkdir /sys/fs/cgroup/net_cls/$CGROUP_NAME
  echo $CGROUP_ID > /sys/fs/cgroup/net_cls/$CGROUP_NAME/net_cls.classid

/bin/echo $$ > /sys/fs/cgroup/net_cls/$CGROUP_NAME/tasks

exec /usr/bin/vboxmanage "$@"

What to do if GitHub only sends emails for a random subset of notifications

For the past few months, it seemed like GitHub wasn’t sending me notification emails for all activity in my watched repositories. I was only getting a random subset of these emails, at most half of what it should have been. I eventually contacted GitHub’s support and they were immediately able to help me. They told me that they send a certain portion of their notifications not directly, but via a third-party service (judging from the message headers, it’s SendGrid).

This third-party service had apprently added me to their suppression list because one email they sent to me months ago had been hard-bounced. There probably was some malfunction on our email server at the time that caused this. I understand why SendGrid does this, but silently discarding any emails GitHub asked them to deliver to me is bad. Really, they should have notified GitHub, which then should have either emailed me at one of the other addresses in my profile or shown a big banner after I log in the next time, asking me to re-confirm my email address to clear the block.

So if this ever happens to you and you’re getting fewer GitHub notification emails than you should be: just email their support and ask them to check. I should have done this sooner, but it really wasn’t clear to me whom to blame (it could just as well have been our mail server’s fault).

MacBook Pro with Thunderbolt 3 and Dell docking stations

I recently got a new MacBook Pro with Thunderbolt 3 ports. These ports are great because they can carry power, USB and output to an external display simultaneously. Not a lot of accessories are available on the market yet though besides simple adapters from USB-C to things like USB, HDMI, VGA, Thunderbolt 2, or Ethernet. Belkin has announced the Belkin Thunderbolt 3 Express Dock HD, but it seems like it will be priced upwards of $300. Dell’s XPS 13 and 15 series however has included Thunderbolt 3 for a year now and Dell makes some nice docking stations for them:

Dell DA200: small portable device with Gigabit Ethernet, USB 3, VGA and HDMI.

Dell WD15: stationary device with a size similar to a paperback book, with a 130W or 180W power brick. Over the DA200, it adds mini-DisplayPort, 2x USB2, 2x USB3, Speaker and Headset outputs. It also passes power through to the computer.

Dell TB15: This model appears to have been recalled because it didn’t run stable and replaced with the TB16.

Dell TB16: stationary device shaped like a stack of a dozen CD cases, with a 180W or 240W power brick. Over the WD15, it adds DisplayPort. Note that it connects to the computer via Thunderbolt 3 instead of USB-C, which means it can deliver higher screen resolutions (see below).

First of all, the DA200 just works. VGA and HDMI run up to a resolution of 2048×1152 pixels at 60 Hz (even though Dell says it only supports 1920×1080), Ethernet works without installing a driver (it contains the same Realtek chip, PCI ID 0bda:8153, that the official Belkin USB-C adapter uses). HDCP-encrypted HDMI works fine, as confirmed by starting a Netflix video.

Next up, the WD15. I first hooked it up via mini-DisplayPort and was disappointed to find that it only runs up to 2048×1152 pixels. Switching to HDMI alleviated the problem and my screen ran at its native 2560×1440 pixels — though the colors were all messed up because the computer was outputting YCrCb while the screen was interpreting that as RGB, but that is easily fixed by creating an EDID override. HDCP works just fine — even though Dell says the dock doesn’t support it. Dell also says that the dock can do 4K resolutions (3840×2160 pixel) only at 30 Hz, so if you want to hook up a 4K display, don’t get this dock. The dock has a power button, but it didn’t surprise me to find out that it didn’t power up the Macbook. The Macbook did, however, automatically turn on when plugging in the USB-C cable, even in clamshell mode. One oddity I found was that the front left USB 3 port would only deliver power, not data, while a device was plugged into the rear USB 3 port — while I did not find this documented anywhere, it seems likely that this is a hardware limitation and not a Mac issue — see below for more information. Audio works fine too (it’s a Realtek chip, PCI ID 0bda:4014), but you need to use Audio MIDI Setup and click the “Configure Speakers” to assign left/right either to the first stream (headphone output on front) or the second stream (audio output on the back):


Finally, the Dell TB16. I didn’t have this one for testing because the WD15 suffices for my application. Dell says that you can run two 4K displays or one 5K (5120×2880 pixel) display, all at 60 Hz. I assume everything else will work just as it does with the WD15. For USB purposes however, I believe this dock contains a PCIe-attached USB 3.1 chip as the USB-C Alternate Mode Partner Matrix shows that a superspeed USB signal cannot be carried if high-resolution video is transferred. According to one of my readers, it doesn’t work out of the box unfortunately.

USB 3 problems

It seems like I can’t reliably use all USB 3 ports on the WD15 simultaneously. Sometimes they all work, but after the next reboot or unplug/replug cycle, once ceases to work. I had initially believed this to be a hardware limitation, but it appears to be a bug in Apple’s USB 3 drivers. sudo dmesg shows messages like these:

000069.247426 IOUSBHostHIDDevice: IOUSBHostHIDDevice::interruptRetry: resetting device due to IO failures

000069.551400 AppleUSB20Hub@14440000: AppleUSBHub::deviceRequest: resetting due to persistent errors

So the OS shuts down the USB hub. If you google these messages, you’ll find a large number of reports of it occurring with all kinds of USB hubs, but no solutions. So Apple just needs to get their drivers fixed.

Scientific Article: Moving charged particles in lattice Boltzmann-based electrokinetics

I’ve published a scientific article in The Journal of Chemical Physics.

Moving charged particles in lattice Boltzmann-based electrokinetics
Michael Kuron, Georg Rempfer, Florian Schornbaum, Martin Bauer, Christian Godenschwager, Christian Holm, and Joost de Graaf
J. Chem. Phys. 145, 214102 (2016)

The journal provides open-access to the article for the first month. After that, you’ll still be able to download it for free from arXiv:

Disabling “secured” IPv6 addresses is macOS 10.12 Sierra

On older macOS versions, every network interface would have one IPv6 address autogenerated from its MAC address, easily identified by the characteristic “ff:fe” bytes in the middle of the host part:
$ ifconfig en0
ether 10:dd:b1:9f:6b:ba
inet6 fe80::12dd:b1ff:fe9f:6bba%en0 prefixlen 64 scopeid 0x4
inet6 2001:7c0:2012:4a:12dd:b1ff:fe9f:6bba prefixlen 64 autoconf

Since macOS 10.12 however, these were replaced with randomly-generated “secured” addresses:
$ ifconfig en0
ether 10:dd:b1:9b:d0:67
inet6 fe80::46:3b36:146:9857%en0 prefixlen 64 secured scopeid 0x4
inet6 2001:7c0:2012:4a:4e6:f1d1:dd90:c6b4 prefixlen 64 autoconf secured

Very little is known about these, besides a single mailing list post that discovered them. If you are running a server, you’ll want your IPv6 address to be deterministic so you can register it in DNS. Therefore, we need to revert to pre-10.12 behavior:

$ echo net.inet6.send.opmode=0 >> /etc/sysctl.conf
$ reboot

If you look at the source code of the XNU kernel (Search for the IN6_IFF_SECURED flag) and the IPConfiguration service in macOS 10.11 (the 10.12 source code hasn’t been released yet), you can see that the new behavior was already there, just not enabled by default like it is now. Also, we now know that the change wasn’t made to reflect RFC 7217 (Semantically Opaque Interface Identifiers) behavior, but rather implements RFC 3972 (Cryptographically Generated Addresses).

Using a BIND DNS server in an Active Directory Environment

Years ago, I posted a script that allowed ISC DHCPd to update a Microsoft DNS server with dynamic records for DHCP clients. I haven’t used that method in a long time and there is a much simpler method: use ISC DHCPd together with the BIND DNS server like everybody else does, and only delegate the _mscds and _sites zones from the BIND server to the Microsoft DNS servers: 86400 IN NS 86400 IN NS 86400 IN NS 86400 IN NS

Then on all your machines, use the BIND server as DNS server (typically set via DHCP option 23). For Windows Domain matters, only records below _msdcs and _sites are ever looked up.

I believe you should even be able to point your domain controllers to the BIND DNS server — they should be able to follow the NS record so that whenever they try to update their own records, they do so on the Microsoft DNS server. As it turns out, the RFC 2136 DNS UPDATE method is used when domain controllers try to register their own records, so you’ll see error messages in your logs if you point your domain controllers to the BIND DNS server (on a Microsoft DC, these would refer to NETLOGON and dynamic DNS registrations, while on a Samba DC they would be about samba_dnsupdate). If you are running Samba 4.5 or higher, you should ensure that samba_dnsupdate is called with the –use-samba-tool flag, which can probably be done by setting the option below in your /etc/samba/smb.conf. If you are running an older Samba version or any Windows Server version, you need to resort to using your domain controllers’ IP addresses as DNS servers on on all domain controllers (Samba: put them into /etc/resolv.conf, Windows: set them in the network interface properties).

dns update command = /usr/sbin/samba_dnsupdate --use-samba-tool

For compatibility with Unix clients (including Mac OS X), you’ll want to add a couple of CNAME records for the SRV records: 86400 IN CNAME 86400 IN CNAME 86400 IN CNAME 86400 IN CNAME 3600 IN SRV 0 100 464 3600 IN SRV 0 100 464 3600 IN SRV 0 100 464 3600 IN SRV 0 100 464

The _kpasswd records unfortunately can’t be CNAMEs because they don’t exist in the _msdcs branch, so you manually need to keep them up-to-date when you add and remove domain controllers.

Boot a Windows install disc from the network using iPXE and wimboot

A while ago, I showed how you can use a Linux PXE server along with a tool called Serva to PXE boot a Windows installer DVD. By now, there is a much nicer solution available that doesn’t require any Windows tools: iPXE with wimboot. So go ahead and replace your PXELinux setup with iPXE first. Then copy the contents of a Windows installer DVD to your TFTP server and make sure that the folder is also shared read-only via SMB. Now copy the wimboot binary to your TFTP server and add something like the following to your iPXE config file:
set serverip
set tftpboot tftp://${serverip}/
set tftpbootpath /mnt/Daten/tftpboot

menu iPXE boot menu
item --key w win10de Windows 10 16.07 x64 German
choose os
goto ${os}

echo Booting Windows Installer...
set root-path ${tftpboot}/ipxe
kernel ${root-path}/wimboot gui
set root-path ${tftpboot}/Win10_1607_German_x64
initrd ${root-path}/boot/bcd BCD
initrd ${root-path}/boot/boot.sdi boot.sdi
initrd ${root-path}/sources/boot.wim boot.wim
initrd ${root-path}/boot/fonts/segmono_boot.ttf segmono_boot.ttf
initrd ${root-path}/boot/fonts/segoe_slboot.ttf segoe_slboot.ttf
initrd ${root-path}/boot/fonts/segoen_slboot.ttf segoen_slboot.ttf
initrd ${root-path}/boot/fonts/wgl4_boot.ttf wgl4_boot.ttf
boot || goto failed

That’s it. When you boot this boot menu entry, you’ll be presented with the Windows installer, but if you click through it, it will at some point ask you for a driver because it can’t find its installer packages. So before you click through, hit Shift-F10 and execute the following commands to set up the network, mount the SMB share and re-execute the installer:
net use s: \\\tftpboot\Win10_1607_German_x64 bar /user:foo

If your SMB server is running Samba, the user you specify (foo) must not exist on the server so you force it to use anonymous authentication. With a Windows server, things might be different.

That should give you a working installer that will get your Windows running within a few minutes because Gigabit Ethernet has a much bigger bandwidth than a spinning DVD or a cheap USB flash drive.

Serva did have one advantage, however: when you set it up, you could inject network drivers into the boot image. With Windows 10, luckily, that has become a non-issue: Microsoft releases a new installer ISO for it about once a year, which you can directly download and which should contain all drivers for the latest hardware available when it was released.

Wiring Fibre Channel for Arbitrated Loop

We have a small Fibre Channel SAN with three servers, a switch and a dual-controller RAID enclosure. With only a single switch, we obviously couldn’t connect all servers redundantly to the RAID system. That meant, for example, that firmware updates to it could only be applied after shutting down the servers. Buying a second switch was hard to justify for this simple setup, so we decided to hook the switch up to the first RAID controller and wire a loop off the second RAID controller. Each server would have one port connected to the switch and another one to the loop.

Back in the old days, Fibre Channel hubs existed for exactly this purpose, but nowadays you simply can’t get them anymore. However, in a redundant setup, you don’t need a hub, you can simply run cables appropriately, i.e. in a daisy-chain fashion. The only downside of not having a hub, the loop going down if one server goes down, is irrelevant here because you have a second path via the switch. For technical details on the loop topology, you can have a look at documentation available from EMC.

To wire your servers and RAID controllers in a fashion resembling a hub (only without the automatic bypass if a server goes down), you need simplex patch cables. These consist of a single fibre (instead of two, like you are used to) and have a connector that looks like half of a regular LC connector. You can get them in the same qualities as regular (duplex) patch cables and in single- and multi-mode as needed. They look like this:

These cables are somewhat exotic, so your usual cable dealer might not have them, but they exist and are available from specialized fiber cable dealers. As you are wiring in a daisy-chain fashion, you need one simplex cable per node you want to connect.

Once you have the necessary cables, you wire everything by connecting a cable from the output side of the FC port on the first device and connecting it to the input side of the FC port on the second device. The second device’s output side connects to the third device’s input side and so on, until you have made a full loop by connecting the last device’s output side to the first device’s input side, like this:

How can you tell the output side from the input side? Some transceivers have little arrows or have labels “TX” (output) and “RX” (input). If yours don’t, you can recognize the output side by the red laser light coming from it. To protect your eyes, never look directly into the laser light. Never connect two output sides together, otherwise you may damage the laser diodes in both of them. Therefore, be careful and always double-check.

That’s it, you’re done. All servers on the loop should immediately see the LUNs exported to the by the RAID controller.

In my setup however, further troubleshooting was required. The loop simply did not come up. As it turns out, someone had set the ports on one FC HBA to point-to-point-only and 2 Gbit/s mode. After switching both these settings to their default automatic mode, the loop went up and data started flowing. My HBAs are QLogic QLE2462 (4 Gbit/s generation) and QLE2562 (8 Gbit/s generation), and they automatically negotiated the fastest common speed they could handle, which is 4 Gbit/s. Configuring these parameters on the HBA usually requires hitting a key at the HBA’s pre-boot screen to enter the configuration menu and doing it there, or via vendor-specific software. I didn’t have access to the pre-boot configuration menu and was running VMware ESXi 6.0 on the servers. The QConvergeConsole for my QLogic adapters luckily is also available for VMware. It is not as easy to install as on Windows or Linux, unfortunately. You first install a CIM provider via the command line on the ESXi host, reboot the host and then install a plugin into VMware vCenter server.