Xen 32bit kernel bug causing VM crashes (emergency downtime needed)

Print Friendly, PDF & Email

Note added 13:33:

The newer kernel wouldn’t boot as it requires a newer version of Xen. The problem we are experiencing is reported as possibly only being an issue in the 32bit OS. I am going to reinstall abbott with the 64bit version and see if I can sync drbd and run the VMs as 32bit DomUs.

In the meantime costello which has also been upgraded to Lenny is running the remaining VMs from abbott. We will see if that experiences the same problems.

Original post:

Last night between 9:00-10.00pm the VM www5-perl which lives on the Xen hypervisor abbott was showing a spike of activity which is mirrored on sabre (database server). Shortly after 10pm there was an unexpected VM crash. This forced a rsync of the underlying block replication pairs and eventually was brought back online by the Ganeti watcher cron around midnight.

Abbott has only recently been reinstalled as Debian lenny and exhibited issues the first night resulting in a reboot. There was no log recorded for a reason but was likely a kernel panic. I have found some evidence that the crash and possibly the reboot is down to a known bug in the 32bit Xen patched Linux kernel. The recommended fix is to upgrade to the back ported kernel from Debian Squeeze.

In order to safely test this I will need to reboot the server. To ensure that the VMs that it hosts do experience extended downtime I will migrate the VMs away to other nodes in the cluster first. This will result in a 1minute outage for each VM. I will then reboot abbott and test the new kernel. If it seems to be compatible with our configuration I will migrate the VMs back. I will need to have the original VMs running on abbott with the new kernel in order to know that it has solved the problem. It may be that with quieter VMs we not see the problem reoccur.

So to recap. I will migrate the following VMs away from abbott now:

  • ?bcfg2-prod.ilrt.bris.ac.uk
  • cos-dep.ilrt.bris.ac.uk
  • pkg-dx86.ilrt.bris.ac.uk
  • www5-perl.ilrt.bris.ac.uk
  • www7-jdm.ilrt.bris.ac.uk
  • www10-rtilrt.ilrt.bris.ac.uk
  • www17-perl-demo.ilrt.bris.ac.uk
  • www18-php-demo.ilrt.bris.ac.uk
  • www20-php.ilrt.bris.ac.uk

I’ll then install the new kernel and reboot. If successful I will migrate the VMs back.

I want to do this asap to give time to resolve the problem before the weekend. I’d rather have 1-2mins downtime for each server than potential hours worth over the weekend.