Server Still Unstable – More Maintenance Tonight

     I am going to be taking the machine down which houses the mail spool and also debian and scientific7 virtual machines tonight to run some diagnostics.

     The machine spontaneously booted again today.  The thing that makes this so difficult to troubleshoot is that it is not generating errors in between reboots.

     I am concerned that the problem with the BIOS the other day either resulted in damage by overheating components, or the motherboard may already be damaged which may have been the root cause of the BIOS losing it’s fan settings.

     I am going to run some diagnostic software to try to see if there is some marginal hardware and also going to try to remove a Linux option that overwrites the processors firmware.  This module caused problems on another of my machines so it may just be bad software, however, there are two identical machines in terms of hardware and only one is having problems.

Power and Phones down

 

      I am with out power and Phones at my home office but if you call and leave a message it will be automatically transcribed and I can pick it up on my tablet.  Alternatel, you can create a ticket.

Maintenance Complete

     It would be the understatement of the year or at least the week to say that the work did not go as anticipated.  The machine I was going to use as a basis for comparison, it also had an old BIOS and given it was the same as the other machine that kept losing it’s fan profile, I decided it was best to upgrade it as well.

     When I upgraded the BIOS it returned to the default settings which meant that I had to determine good settings from scratch.  While I was doing this another machine crashed.

     I don’t really know what caused the crashes but I found a number of problems.  For one thing VTd instructions were not enabled but should have been since we have kvm/qemu virtual machines on these boxes.  Not having it enabled isn’t a show stopper but it causes Linux to  more work to perform I/O in a virtual machine.

     Then I found an issue with an item that allows full 64 bit address decoding for I/O devices.  For some reason this breaks the built in graphics of the i7–6700k causing instability.

     I now have all the machines on current BIOS, ran some torture tests on the CPU and they ran clean.  So we’ll see how it goes.  I had to slow down the CPU’s slightly on two machines to get them to complete the torture tests without error.

Mail, Debian, Virtual Private Servers

     Virtual private servers will go down briefly so I can compare BIOS settings between a machine that is stable and the mail server which presently is not.  When I updated the BIOS to resolve a fan issue, it reset all settings to default.  I tried to set them back the best I could by memory but suspect didn’t do the best job so will be attempting to correct that tonight.

     The virtual private servers should only be down a few minutes, the mail server somewhat longer as I will be running memory diagnostics to eliminate the possibility of a bad DIMM.  It may take several hours of intermittent downtime to resolve these issues.

Server Still Unstable

    The host machine which houses the mail spool, mail server virtual machine, and debian virtual machine, spontaneously booted today even though temperatures are correct.  So I am going to take it down around midnight tonight to run some diagnostics and make some adjustments.

 

Mail Server Maintenance Completed

     The BIOS had corrupted the Q-fan profiles and turned one chassis fan off entirely, put the others in a mode where they were just barely running so the machine was overheated.  A BIOS upgraded fixed the fan issues.

     While I was there and letting a torture test run for a while (8 copies of an AVX version of prime95 running small fft’s, about the most torturous thing you can throw at a modern CPU, I worked on some other issues and discovered the problem with postfix not starting on debian, missing postfix and postdrop groups and postfix user.  I’ve fixed this problem before but it turns out the post.install script of the postfix package distributed with stretch (and Ubuntu 16.04.3 LTS) is broken.  So it will likely break again the next time a postfix update is kicked out.

Mail Server Maintenance

     I need to take the server that holds the mail spool down a few times briefly tonight to make some BIOS adjustments.  I believe I’ve got the CPU voltage just a touch too low and it is slightly unstable.  This may take several reboots and benchmarks to find the correct level.  It is presently unstable and has rebooted spontaneously twice in the last couple of days.  This is symptomatic of CPU voltage too low for the clock rate and I did adjust it down as low as I could get it to run, what I thought was stable, the last time I worked on it, so I’m going to crank it up slightly.

Mint 18.2 Sonya Operational

     Our shell server mint.eskimo.com has been upgraded to Sonya 18.2.  Initially I had some problems getting postfix working, chased it down to sendmail already being installed and not automatically deinstalled when postfix was installed, as it normally is on other debian derivatives.

     Also had issues with idmapd not working initially, this turned out to be an old issue with nsswitch.conf, compat does not work as advertised, it is necessary instead to specify nis files, which gives the behavior compat is supposed to but does not.

Kernel Upgrades Completed

     The kernel upgrades are completed.  All host machines rebooted properly.  The longest took 3 minutes.  However, the web server did not properly NFS mount the /home directory after rebooting so some websites were unavailable for about 1/2 hour until I discovered and corrected this.