For compatibility with a new guest OS, I upgraded my ESXi to 5.5 today. During reboot, it crashes after a few seconds (it briefly flashes a message about starting up PCI passthrough on the yellow ESXi boot screen). The purple screen of death (PSOD) I get looks like this:
VMware ESXi 5.5.0 [Releasebuild-1474528 x86_64] #PF Exception 14 in world 32797:helper1-2 IP 0x4180046f7319 addr 0x410818781760 PTEs:0x10011e023;0x1080d7063;0x0; cr0=0x8001003d cr2=0x410818781760 cr3=0xb6cd0000 cr4=0x216c frame=0x41238075dd60 ip=0x4180046f7319 err=0 rflags=0x10206 rax=0x410818781760 rbx=0x41238075deb0 rcx=0x0 rdx=0x0 rbp=0x41238075de50 rsi=0x41238075deb0 rdi=0x1878176 r8=0x0 r9=0x2 r10=0x417fc47b9528 r11=0x41238075df10 r12=0x1878176 r13=0x1878176000 r14=0x41089f07a400 r15=0x6 *PCPU2:32797/heler1-2 PCPU 0: SSSHS Code start: 0x418004600000 VMK uptime: 0:00:00:05.201 0x41238075de50:[0x4180046f7319]BackMap_Lookup@vmkernel#nover+0x35 stack: 0xffffffff00000000 0x41238075df00:[0x418004669483]IOMMUDoReportFault@vmkernel#nover+0x133 stack: 0x60000010 0x41238075df30:[0x418004669667]IOMMUProcessFaults@vmkernel#nover+0x1f stack:0x0 0x41238075dfd0:[0x418004660f8a]helpFunc@vmkernel#nover+0x6b6 stack: 0x0 0x41238075dff0:[0x418004853372]CpuSched_StartWorld@vmkernel#nover+0xf1 stack:0x0 base fs=0x0 gs=0x418040800000 Kgs=0x0
When rebooting the machine now, it reverts to my previous version, ESXi 5.1-914609.
A bit of playing around revealed: This only happens if I am connected to the Intel AMT VNC server. If I connect after ESXi has booted up, it crashes a fraction of a second after I connect to VNC. Go figure! Apparently it’s not such a good idea to have a VNC server inside the GPU, Intel…
Before I figured this out, I booted up the old ESXi 5.1.0-914609 and even upgraded it to ESXi 5.1.0-1483097. Looking at dmesg revealed loads of weird errors while connected to the VNC server:
2014-02-13T11:23:15.145Z cpu0:3980)WARNING: IOMMUIntel: 2351: IOMMU Unit #0: R/W=R, Device 00:02.0 Faulting addr = 0x3f9bd6a000 Fault Reason = 0x0c -> Reserved fields set in PTE actively set for Read or Write. 2014-02-13T11:23:15.145Z cpu0:3980)WARNING: IOMMUIntel: 2371: IOMMU context entry dump for 00:02.0 Ctx-Hi = 0x101 Ctx-Lo = 0x10d681003
lspci | grep ’00:02.0 ‘ shows that this is the integrated Intel GPU (which I’m obviously not doing PCI Passthrough on).
So
- ESXi 5.5 panics when using Intel AMT VNC
- ESXi 5.1 handles Intel AMT VNC semi-gracefully and only spams the kernel log with dozens of messages per second
- ESXi 5.0 worked fine (if I remember correctly)
I have no idea what VMware is doing there. From all I can tell, out-of-band management like Intel AMT should be completely invisible to the OS.
Note that this is on a Sandy Bridge generation machine with an Intel C206 chipset and a Xeon E3-1225. The Q67 chipset is almost identical to the C206, so I expect it to occur there as well. Newer chipsets hopefully behave better, perhaps even newer firmware versions help.
Update November 2014: I just upgraded to the latest version, ESXi 5.5u2-2143827, and it’s working again. I still get the dmesg spam, but the PSODs are gone. These are the kernel messages I’m seeing now while connected via Intel AMT VNC:
2014-11-29T11:17:25.516Z cpu0:32796)WARNING: IOMMUIntel: 2493: IOMMU context entry dump for 0000:00:02.0 Ctx-Hi = 0x101 Ctx-Lo = 0x10ec22001 2014-11-29T11:17:25.516Z cpu0:32796)WARNING: IOMMU: 1652: IOMMU Fault detected for 0000:00:02.0 (unnamed) IOaddr: 0x5dc5aa000 Mask: 0xc Domain: 0x41089f1eb400 2014-11-29T11:17:25.516Z cpu0:32796)WARNING: IOMMUIntel: 2436: DMAR Fault IOMMU Unit #0: R/W=R, Device 0000:00:02.0 Faulting addr = 0x5dc5aa000 Fault Reason = 0x0c -> Reserved fields set in PTE actively set for Read or Write.
So basically, Intel AMT VNC is now usable again.
Update August 2015: ESXi 6.0 still spams the logs, no change over ESXi 5.5.