Adaptec Madness (or why I started this blog)
One of the less maintained machines in our data center is a small backup server. It's basically sitting there, storing all those backups nobody really uses and slowly fills up its disk array. Once in a while somebody checks it manually to see if everything's in order (it has never been really connected to anything sitting on its own isolated VLAN).
The server uses an Adaptec 5805 controller, a product line I tried debugging and fixing a long time ago (see also http://lists.freebsd.org/pipermail/freebsd-bugs/2009-June/035612.html). Unfortunately back then our supplier stated that this is the only controller certified for this machine - so I gave in, hoping that a pretty much bored backup server won't have too much trouble keeping up with the little load it would get.
I was wrong.
The original setup of this server was wasteful, since at the time neither FreeBSD offered an easy way of booting from ZFS, nor did the controller offer working JBOD support. So I ended up creating 4 x RAID1 in the controller BIOS and formed a ridiculously large set of traditional UFS2 partitions on a 1TB RAID1 and a ZFS stripe of 3 x 1TB essentially forming a RAID10 container for storing backups. This is of course more than suboptimal, but there was little time and less motivation.
All of that was almost forgotten, when the server started behaving erratically lately, becoming slow and sluggish. Checking the log files, I realized that the controller was causing the same trouble like its smaller brother (Adaptec 5405) about three years ago:
Apr 8 07:15:35 backup kernel: aac0: COMMAND 0xfffffffe80232610 TIMEOUT AFTER 32 SECONDS Apr 8 07:15:55 backup kernel: aac0: COMMAND 0xfffffffe80234fa0 TIMEOUT AFTER 94 SECONDS Apr 8 07:15:55 backup kernel: aac0: COMMAND 0xfffffffe80232610 TIMEOUT AFTER 52 SECONDS Apr 8 07:16:15 backup kernel: aac0: COMMAND 0xfffffffe80234fa0 TIMEOUT AFTER 114 SECONDS Apr 8 07:16:15 backup kernel: aac0: COMMAND 0xfffffffe80232610 TIMEOUT AFTER 72 SECONDS
And finally the controller stopped running all together.
After power cycling the server I installed sysutils/smartmontools to figure out what's wrong. I realized that one of the hard drives was flaky (huge number of sector reallocations) and needed to be replaced ASAP. Funny enough, this was what caused the controller to stop functioning. As an additional test I tried pulling and remounting the drive while the system was running, to understand what's going on - only to learn, that this would crash the controller as well.
To make sure this is not anything that might have been fixed by the engineers at Adaptec, I installed the latest firmware for that controller and also for the Super Micro motherboard - all to no success.
At this point it was pretty clear, that dealing with this controller was pointless - so I started looking for a replacement that would offer all the features required for powering a FreeBSD/ZFS server. It was also clear, that the setup of the machine really needed to be redone properly, so the system's health etc. would be monitored properly.
And I decided that (once everything had been fixed) I'd log all the little dirty tasks involved - like cross flashing disk controllers - to make sure there would be proper documentation next time. Since sharing is caring and making somebody's life somewhere a little bit easier is a Good Thing(tm), I'd do it in the form of a public blog.
So here it is, my first blog. Not sure if it leads anywhere, but at least I'll be able to tell my grand children that I had one of those before publishing on the internet became illegal.
-- Michael