We had an issue that I became aware of around midnight, but that probably had been going on earlier in the day, where mail backed up in queue on our servers that take mail incoming from the Internet, rather than deliver properly.
There were multiple issues, Yahoo couldn’t handle list mail as fast as our server wanted to send it, so that caused some congestion of the mail queue.
I didn’t edit the network-config script when I changed name servers so the machines reverted to old name servers not as well tuned to handling the mail volume.
The lock file on the bayesian spam filtering didn’t get removed during an update for some reason. This also caused things to back up.
The number of processes postfix was allowed to fork was excessive and driving the machines into swap, that also has been adjusted.
It isn’t completely finished processing the queue but the servers are chunking away at a good rate and should be fully caught up in a couple of hours.
The new server software seems to have stopped the crashes but I still wasn’t satisfied with the performance of ns1 and ns2, especially ns1. I discovered that the main load was from the mail servers accessing various real time black hole lists as part of Spamassasins’ spam scoring.
To alleviate this I set-up two new name servers specifically for mx1 and mx2 to use, taking the load off of ns1 and ns2. These new servers are not directly available to the public but have been created to reduce the load and improve the response time to those that are. I now consistently get responses <100ms on all four public name servers.
The latest stable release of BIND (named), 9.10.1-P2 is installed on all of our name servers. Hopefully this will eliminate the exploit someone has discovered in the last few days.
I am updating BIND because the version that we have has been crashing due to recently discovered exploits. Each name server will be out of service for a period of time and during this time responses to various servers may be delayed by 5 seconds if the resolver for that service happens to pick the out of service name server first.
I’m back in the office. Eskimo.com shell server is restored to services. Router filter adjustments are completed.
I will be going to the co-location facility to attempt to resurrect the old Eskimo.Com SunOS shell server.
In addition I will be working on some firewall rules to prevent idiots brute force password guessing from locking me out of the router.
If you call during this time, please leave a message and include the latest time that it will be okay for me to call you back. If it’s before that time when I return I will call.
The eskimo.com shell server (old SunOS 4.1.4 machine) is down and will remain that way until tomorrow afternoon. It is stuck in a mode where the kernel is still running and it can be pinged, but no user mode programs get CPU time. This is an old SunOS bug.
We only have one car and my wife has it right now. In the meantime please use one of the other shell servers: shellx.eskimo.com, debian.eskimo.com, ubuntu.eskimo.com, opensuse.eskimo.com, centos7.eskimo.com, scientific.eskimo.com, scientific7.eskimo.com, fedora.eskimo.com, or mint.eskimo.com.
Even with all the changes I made, there was STILL an occasional slow load of images, and the delay was always 5 seconds when this happened.
I finally was able to chase this down. Rate limiting on the name servers was kicking in and 5 seconds is the default resolver time before it gives up and moves on to the next name server.
I have adjusted the rate limiting up so that they will rarely be invoked, added a rotate argument to the resolv.conf configuration so that it spreads the load across the servers, and I’ve added timeout:1 to make it give up after one second instead of five if a server doesn’t respond to minimize the page load delay in the event of a server failure.
This seems to have completely eliminated the delayed loading.
I have made some configuration changes so that when you go to eskimo.com or www.eskimo.com, you will connect SSL even if you do not specify https in the URL. This is to prevent user authentication information in applications like WebMail, the Forums, or PHPMyAdmin from being sniffed en route and also to secure your data.
I will be taking the web server down approximately five minutes after midnight tonight in order to image it so that the changes I made to improve responsiveness are not lost in the event a restoration from backup becomes necessary.