I apologise for the downtime Sunday evening. What follows is a description of the problems I ran into which caused this.
It was about 6pm. J- and I were trying to figure out some issues we had been experiencing with XMPP. I run ejabberd in a VM on my server, which I’m reasonably happy with. J- on the other hand was using a Google Talk account, but always appeared invisible on my contact list. Yet, I was clearly visible and online on her roster.
My suspicions were that it was somehow related to Google Talk – it’s been in the news that Google is breaking federation, and they have broken it (partially at least) in the past. J- sought to fix this by signing up for a dukgo.com account. Oddly, this resulted in the same strange issue.
Next, I thought I might want to investigate my own XMPP server. I was only running stock Debian Squeeze, so figured I should probably upgrade to the latest stable before spending any significant amount of time on it. After all, how long could an upgrade take? It was 6:30pm on a Sunday evening, but I also had slides to come up with for a talk at LUV Tuesday night. Surely the upgrade wouldn’t take more than about an hour?
After all the packages had been upgraded, it was time to reboot the instance into a new kernel. That’s when I ran into my first problem – the instance refused to boot. It seemed that pygrub, which is what I was using for a boot-loader, was unable to parse the newly generated grub.cfg file.
Pygrub is a part of my dom0, which also was running Squeeze. My thinking was that hopefully if I upgraded the dom0 to Wheezy too, it will support the new Grub configuration format. Worth a shot. And so I began the dom0 upgrade.
After all the packages on the dom0 were upgraded, it was now time to reboot and cross my fingers. Thankfully, the reboot was successful. I was very glad to see the processes of runlevel 2 initiate. Very glad… except one of my instances refused to boot. Not just any instance, but my firewall! No more Internets! Panic started to settle in.
The ADSL modem connected to the server via USB. The entire USB controller was using xen-pciback for device pass-through to to the guest. This functionality was no longer working – the dom0 decided that the device was no longer available and could not be passed through. If it could not be passed through, the firewall instance refused to start (and wouldn’t be very useful even if it did). This was starting to be a real annoyance. Now I had to unload the kernel modules, play with /sys entries to free up the device, and then boot the firewall again. There was some tinkering with dom0’s Grub kernel parameters along the way, but eventually I got the firewall to boot *and* see the USB device. It took hours, but I finally did it. Sorta.
There were a ton of USB driver error messages in dmesg output of the firewall. The USB stack was failing and was unusable. I tried various pass-through configurations, but ultimately I was not able to get the guest to use any kind of USB device. Seems like some kind of regression.
At this point it was getting quite late, and I wasn’t in the mood for playing around any longer. I just wanted things working again – and preferably without having to undo all my work by restoring from backups. Fine, I thought. If I can’t pass through the USB controller, I’ll just install a spare PCIe NIC and pass through that instead. After all, my modem supports connectivity from either USB or Ethernet, and it doesn’t matter to me which.
Although this seemed like a good approach, and I had the hardware to spare, things once again didn’t work out. The dom0 kernel wanted to load the device drivers of this hardware for itself, and I would have to prevent that if I were to be able to use that in the guest. The kernel driver module was r8169. I started creating entries in /etc/modprobe.d/ and rebuilding the initramfs, which is when it hit me… this is the same kernel module as used by the other integrated network port in the server – which I very much need. If I prevent this from loading, I won’t be able to remotely connect to the server any more via my LAN!
It was somewhere in the early hours of Monday morning, I had no Internet access (except through tethering with my N900), I had to go to work the same day, I had not had much sleep the night before, I had slides for a presentation that needed to be created, and I knew J- would kill me if I left the server in this broken state for too long. Further, I wasn’t sure how to proceed, and (to add insult to injury) my N900 battery just died.
I checked the server, and observed that it had two unused PCI slots. Thankfully my home server runs on an old budget motherboard that still supported them, as I figured I could scrounge up an old PCI NIC or two. After pulling some old boxes out of storage, I did indeed find spare PCI NICs. The first one I tried required yet another r8169 kernel module, but then I found an old PCI NIC that was gigabit and had heatsinks on it! I couldn’t see what it was under the heatsinks, but given that the other chipsets were bare, it seemed it would probably be something different. Turned out to be some kind of National Semiconductor NIC. No idea where I brought it from or how long I have had it for, but it proves that sometimes it really does pay to keep old crap. 🙂
So, after installing it, messing around a bit with /etc/modprobe.d/ rebuilding initramfs, tinkering with the dom0 kernel parameters to provide appropriate device-specific xen-pciback parameters (because I’d forget about them if they weren’t in /proc/cmdline), changing the firewall VM configuration profile, etc… my Internets were back.
Unfortunately, even as I write this I still have not had time to go back and investigate the original issue – J- is still invisible to me in my roster when she should appear as online.