Xen random reboots (possibly) resolved

Print Friendly, PDF & Email

Well I think that we’ve found the problem. When I decided to migrate europa our primary DNS server to a Xen domain in order to consolodate hardware the Xen host started to reset and reboot randomly. At the same time there were a recorded increase in /var/log/messages of IMCP type 8 (ping) packets being rejected. In order to diagnose the problem a serial terminal session was attached and the output was recorded to a log file. Last week when Kieren and Paul reported downtime on DNS a log was generated of a kernel panic OOPS. The main fault was in a symbol called ‘csum_partial’. Dan Foster suggested that there may be a relationship between csum and the unnecessarily enabled tcp segmentation offload setting on a part of the network stack in the Xen host architecture. This triggered a memory of a previous problem.

A while ago I had a problem with the encrypted connections with bcfg2 clients on servers using the Intel e1000 network module in debian etch. The reconnection problem in bcfg2 was patched and the local fix was to turn off receive and transmit checksumming on the external physical network interface. This setting is disabled using the same tool to disable tcp segmentation offloading. Since we’ve disabled rx/tx checksumming we have not experienced any further reboots.

This may also be the solution for the random reboots experienced with abbott. The option to enable this is kept in the network configuration file on these servers so that it is applied to the interface on reboots. The command was present in costello but not on himalia or leda, and more interestingly it was present on abbott but (due to a typo on my part) was using incorrect syntax. I have corrected and applied the configuration setting.

(Touch wood) we have not experienced a reboot on any server since this change. :-)

Thanks to Dan for his help on this.

I will now make a point of adding this command to system startup scripts for Xen hosts as a matter of policy. I would like to spend some time converting the network configuration files from flat files to dynamic templated ones as they are managed by bcfg2. Then this policy can be applied more consistently.


About this entry