Tonight, around 7:30 our router and most of our servers crashed. The router rebooted, only one of the Intel servers came back up, the Sparc servers all survived the event.
I suspect a power hit, but I tried to call the people at Isomedia tech support and only got voice mail with a full mailbox so I was not able to find out what happened tonight.
I rebooted and brought all the servers back online. I did discover some configuration errors with nsswitch.conf on one server and fstab on another that prevented them from fully coming back on their own and corrected those.
The outage today was the result of IsoFusion (formerly IsoMedia) working on the flaky jack in our cabinet. They moved it and it went out altogether. Turned out there was a bad jack AND a bad cable. Both have been replaced and the connection is now solid.
The Ubuntu update is still in progress.
It is taking longer than it should have first because it wanted more memory than it had, and second because in my attempt to fix that I actually read the manual for virsh, and it said that the default unit was kilobytes and not thinking I thought ah 4096kb = 4gb, actually 4096mb = 4gb so I think you can probably deduce what happened.
So I exploded it trying to make modern Linux run in 4MB, (ain’t gonna happen) and had to run dpkg –configure -a, and that blew up because of some missing dependencies, which have been resolved so it’s running now but going to take a while.
A reboot will still be necessary when it is all done and in the meantime there are probably some things not working.
I am taking debian.eskimo.com down for about an hour to move it from one host to another less occupied host. The only person currently logged in has been idle for 25 hours. If you need a Debian based host in the meantime, Mint and Ubuntu are presently available.
An upgrade of ubuntu.eskimo.com from Vivid Vervet to Wily Werewolf is underway. When it completes, a reboot will be necessary to activate the new kernel.
Tonight I’ll be installing a shelf and an upgraded server in the co-location cabinet. Hopefully this will not disrupt any existing services. I’ve got the wiring strapped down now so the chances of accidentally unplugging something are significantly reduced.
I do have to move the connection to the co-lo providers router to a different jack which hopefully will eliminate problems with an existing flaky jack and this will interrupt service briefly.
When the new server is operational, I will be moving some of the existing guest machines to it. This will cause some interruptions but these are mostly little used shell servers like opensuse so it should not be too disruptive.
Since I will be at the co-location facility and not here, I won’t be able to answer the phone live. If you do encounter a problem please either use the Support -> Tickets ticket system or leave a voice mail.
On Friday evening I will be going to the co-location facility to install a cabinet shelf and upgraded server.
While I am there I will also be moving an the Ethernet connection that feeds our cabinet to a different jack which hopefully will eliminate a marginal connection with the existing jack.
The latter work will require a brief interruption in network connectivity.
Our incoming mail servers got very clogged up. There were a number of contributing factors. When I recently moved them I didn’t allocate enough RAM for them to operate efficiently. CentOS recently pushed out a broken update for fail2ban and so a lot of crap that is normally blocked was hitting them. A virus was released in the wild and circulated fairly well before the clam-av folks got a signature for it in their database. There was no bounce time limit configured.
All of these things have been addressed and now they’re cleared out and functioning normally.
This is a posting for those who administer Centos 6 based systems with the hope that it will save you some grief.
There is a bug in the current version of fail2ban being distributed in the CentOS 6.7 repositories. It will not create IP table entries.
The cause is the inclusion of the -w (locking) flag in the current version of fail2ban which is not supported in the version of iptables used.
The fix is to edit /etc/fail2ban/action.d/iptables-common.conf and remove the -w flag.
My apologies to everyone on shellx this evening. I accidentally powered it off, meaning to halt a different machine that I was experimenting with NetBSD on and accidentally hit the wrong machine.