Mail Server Down

     I’m not sure if it hung going down or coming up but the mail server did not reboot properly.  It pings but I can not connect to any services.  So I am heading down to the co-location facility to resolve.  Will take about 45 minutes.

Server Reboots

     I will be interrupting services briefly tonight to reboot the physical hosts and a number of the guest machines for kernel upgrades.  These interruptions should be relatively short in duration, around ten minutes.

Ticket System Upgraded

     osTicket had been upgraded from 1.10 to 1.10.1, not the versions I thought we are on, but it’s the most recent stable version.

     They’ve apparently moved things away from open source and github now has old versions rather than the current versions.  Not happy with this move as it gives community less opportunity to contribute to it’s development.

Tickets

      Please refrain from creating trouble tickets until further notice this evening.  In the meantime please either e-mail support@eskimo.com or call 206-812-0051.

     I will upgrading OSticket this evening.  OSticket has completely changed database schemas and so I will not be able to carry over old tickets.  However right now there are no open tickets which makes it an ideal time to do this update.

All Services Moved Off of Failing Hardware

     All services have been moved off of failing hardware.  There should be no further unscheduled interruptions.

     There will be some interruptions of services lasting about 20-30 minutes per service between 10pm-6am over the next few days in order to make backups of the newly configured machines so that if there is a failure restoration returns to the current configuration.

Sick Host Machine

     I think the motherboard is damaged in one of our hosts that holds the mail spool, mail.eskimo.com, mx2.eskimo.com, debian.eskimo.com, and scientific7.eskimo.com.

     In addition to random reboots, the machine is sometimes taking disk errors but the smart status does not show any issue with the drives, no errors recorded, which leaves the controllers which are on board.

     So tonight various services mentioned above will be down for a period of time as I move them off of this failing machine so I can take it out of service and replace the motherboard.

     When the BIOS lost it’s fan settings resulting in the shutting down of a chassis fan, it got quite warm, but it’s hard to say if the heat damaged it, or existing damage caused the BIOS to lose it’s settings.

     At any rate I am moving things off so I can take it out of service for several days to replace the motherboard and then to burn it in properly (extensive testing with mprime, while monitoring temperatures, etc.  This is both to make sure it is stable and to find the minimum voltage the CPU will operate stably on.  The lower the voltage, the less the heat. and the longer the life.

Fedora Upgraded FC27

     Fedora.eskimo.com is now upgraded to Fedora 27.  Not all the applications previously installed are there because I did a clean install rather than an upgrade as the previous two upgraded had not gone cleanly and left the rpm database less than pristine.

     If there are applications which are missing that are of importance to you please e-mail support@eskimo.com or open a ticket and I will prioritize getting those installed for you.

 

Server Still Unstable – More Maintenance Tonight

     I am going to be taking the machine down which houses the mail spool and also debian and scientific7 virtual machines tonight to run some diagnostics.

     The machine spontaneously booted again today.  The thing that makes this so difficult to troubleshoot is that it is not generating errors in between reboots.

     I am concerned that the problem with the BIOS the other day either resulted in damage by overheating components, or the motherboard may already be damaged which may have been the root cause of the BIOS losing it’s fan settings.

     I am going to run some diagnostic software to try to see if there is some marginal hardware and also going to try to remove a Linux option that overwrites the processors firmware.  This module caused problems on another of my machines so it may just be bad software, however, there are two identical machines in terms of hardware and only one is having problems.