Mail Capacity

     I had to re-install Dovecot over the weekend because something went wrong internally that I was unable to identify.

     I installed with mostly default settings, changing on a few things I knew were necessary for our installation.

     This morning it reached some of the internal limits set.  I’ve addressed those issues today.  Things like the total number of clients that can be active at once.  I would have thought 1000, which is the default sufficient, but many people have half a dozen devices polling so it was not.  I bumped that to 5000.  Also fixed some configuration issues with locking, indexes, added some client bug workarounds to work around bugs in Thunderbird and Outlook, Outlook chokes and NULLS, and Thunderbird adds an extra ‘/’ to mail paths.  A few other minor tweaks but seems to be working okay now.

Ubuntu Spontaneously Booted 10:08AM Today

      The only thing I could find in the logs is that the guest tried to use a VT-x feature not supported by the host CPU, but the host CPU DOES support VT-x and the guest is configured to copy the host CPU configuration so it should not use anything not supported by the host as it’s configured to use the same CPU.

     This is the first time I’ve seen this particular error and this configuration has been operating for ten years, so I’ve got to assume a random CPU or memory error.

New Features

There are some New Features I’d like to make you aware of:

On our website, https://www.eskimo.com/, at the far right on the menu bar there is new button “Issues“.  Press on this button to see the latest news, whether it be outages or new features or something else.

The new NIS server is Linux based eliminating the compatibility issues that prevented passwords being changed directly by users.  You can now change your password on most shell servers by typing ‘passwd’, it will prompt for your old password and then the new password twice.  This works on all shell servers EXCEPT Ubuntu, Centos7, and Scientific7.

And on the subject of passwords, because the old system the password significant characters were only eight.  If you typed a longer password those after the first eight were ignored.  Now up to 16 characters are significant.  If your password is longer than eight character but does not work, try typing the first eight characters, if that works use passwd to change your password to the desired longer form.

If you create a new account today, logins are no longer limited to eight characters, but also can be as long as sixteen.  The application form has been updated to reflect this.

Summary of What Happened

     Over the past 24 hours we’ve had three server issues.  First, some changes to the network were necessary to accommodate a new router that will be replacing our existing router.  The new router does not do port based network address translation which we previously used to provide services from two different servers on a single IP address, specifically, eskimo.com provides shell access to an antique SunOS 4.1.4 based machine that also served as the NIS master and had a database I used for accounting.  That was shared with a different server that provides web services, so that if someone typed https://eskimo.com/ instead of https://www.eskimo.com/, they would still make it to our website.

     So to accommodate the new router (which has not yet been placed into service), I changed the name of the shell server to sunos.eskimo.com so eskimo.com could be pointed to the web server directly without the need of Network Address Translation.

     There were unintended consequences of this in that for some reason Sun’s calendar manager database, a database I’ve hijacked to use for accounting purposes, would not work with the name change.

     Well I gave it about five minutes of though and though this machine is 30 years old, if it dies parts are all made out of unobtainium, so best move these functions to modern servers.  The problems, well the database was proprietary so I can’t easily move it, so I dumped it as a plain text file and am hand inputting all the info into a new database that is linux based.  And I’ve wanted to switch NIS servers for a long time anyway because the SunOS server is only capable of eight character user names, 16 bit user IDs, 8 characters significant in the password which is not very secure, and triple DES encryption also not secure.  Where as the Linux NIS server can handle seven different types of encryption, all of which are superior to triple DES, 16 character long usernames, 32 bit user ID’s so enough for 3/4ths of the planets population if they all joined, and it would enable passwords to be changed by users directly again, well, it will once we get all the bugs squashed.  But there was an issue with the Linux implementation of NIS that I was not aware of until after the machine crashed this morning and that is that with SunOS, local passwords are in local files, and then the same files with the extension “.yp” get exported to the network.  There were also some format differences between SunOS’s passwd.adjunct file used for encrypted passwords and Linux’s “shadow”, but I figured those out and was able to use a few global exits in vi to fix.  But since Linux does not use separate files, the way it does it is system logins have uid’s < 1000, and user ID’s are 1000 and greater.  And all the data is in one file.

     I didn’t think this would be an issue as when I have uid’s both in local and network systems, I have them the SAME so it does not matter which mechanism a machine obtained them from, it would get the same values either way, however after reboot of the new master NIS server this morning everything went to hell and a hand basket.

     I had already segregated UID’s local verses system and all users have system id’s of 2000 or more, but I had not similarly segregated groups, and there in is where the problem lie.  So that meant deleting the old 50, 51, 52, 53, 54, 57 gids in use for various types of accounts and replacing them with groups above 1000.  Doing that for 500 users at once is non-trivial so that is why there was a period where the group id on your files did not resolve to a name.

     And by the way the drive to the co-lo and back was two hours each way owing to absolutely shit traffic on I-90, I-405, and I-5 so much of the drive went through back roads of Southern Bellevue, and for some reason their transportation engineers are utterly incapable of drawing a straight line and so my trajectory was anything but direct or fast, so my average round trip speed over 22 miles each way was about 10MPH.

     Then I get back and find the mail server is on it’s ass.  And I look at the logs and the authentication logs and mail logs are both complaining about “extra groups” and they were just those groups I had entirely removed from the network.  So apparently Dovecot which both provides imap and pop3, and also serves as authentication server for mail, it apparently caches some of this data somewhere then rather than just  update it’s fricken cache when it changes, it instead refuses to authenticate.  And I could not figure out where it was squirreling this data, so I completely purged and re-installed dovecot, only the challenge there was that they had changed the configuration significantly since I last installed it.

     So that was the cause of the outage today.  The good side is we are no longer relying on 30 year old hardware, once I get bugs worked out you will be able to change your passwords yourself, and longer usernames are now possible.  And we’ve eliminated anything hardware wise that will be an issue in 2038.  There are still some file systems in use with 64-bit inodes that only go to 2038 but all the new I am creating with larger inodes that can accommodate all of known history so by the time 2038 gets here is should be a non-event just like 2000, if I’m still alive (I will be 80 if so).

     For all of those who called, e-mailed, generated tickets, to let me know things were down, I appreciate that, but once aware I prefer to dedicate my energies to getting things back up, so please, in the future, check our website https://www.eskimo.com/news before calling, e-mailing, or generating a ticket.  Thank you.

Mail Server is back UP

     The mail server is back up.  I don’t know what happened to Dovecot but it was generating errors I’ve never seen before, errors referring to “extra groups” and I was not able to gain any useful insights from Google, Bing, reboot didn’t help, debug didn’t offer any useful insights, etc.

     Finally I purged Dovecot (the imap, pop3, and mail authentication server) and re-installed from scratch.  The config files had changed significantly since when I installed it so it is possible an update changed the software to expect the new format but didn’t other to tell me.  At any rate it is back up.  There may be some rough edges yet but it is at least functioning.

Mail

The main NIS server is backup but the mail server is still not operating correctly complaining about “extra groups”, a message I’ve never seen before and not having any luck googling.  I suspect some old auth info is being cached somewhere but damned if I know where so I’ve purged the whole thing and am re-installing from scratch.

 

Authentication Problems

I re-configured the NIS system to have a Linux based NIS master last night.  It was working last night but the main NIS server crashed during the night and although it should still work from the slave servers it is having problems.  So I’m going down to the co-lo facility to restore the crashed server and then if the problem still persists further troubleshoot the issue.  This is affecting mail and some shell servers.

Router Replacement

     The router replacement that was planned for Saturday is probably going to be delayed until Monday.

     In the meantime, the old SunOS machine is being retired and I am setting up a new NIS master on a Linux machine.  Because of differences in the way Linux and SunOS NIS are implemented, SunOS puts the items that are to be distributed network wide in a different set of files, where Linux propagates everything above a certain UID and GID to the network, I am going to have to change some GID’s, some in the 50’s will become higher numbers so that they will be in the NIS system.  This will require finds and chgrp’s on big file systems so there may be some period where you’re GID is ’57’ instead of ‘rmtonly’ which will become ‘shell’ since access is really not part of the plan nor relevant to the platform GID any longer.

     I will potentially resurrect SunOS some time in the future when I can get an emulator working but we will need to find some alternative method of authentication as I am going to start allowing 16 character usernames, longer passwords, and new stronger encryption, as well as yppasswd on the client machines so you can change your password yourself.

Network Work this Saturday

     This will impact ALL of eskimo.com‘s services.

     If all goes well I am going to replace our router this Saturday evening.  It may take a bit to get it to work as the interface is quite different than the old and I’ve got some concerns about configuring the network side of things.  After this happens the shell server, “eskimo.com”, will become “sunos.eskimo.com”, and possibly before, this is because of the new routers lack of support for port forwarding and the need to separate shell services from web services on this IP.

     The main reason for this change is that traffic has grown to where our existing router is challenged and the occasional denial of service attack, which is just a reality of being connected to the Internet, is enough to overload the CPU and cause traffic to be dropped or significantly delayed.

     The existing router has two 500Mhz PowerPC cores, the new unit has 4 1.6 Ghz PowerPC cores or roughly six times as much CPU.  It also has a 1TB hard drive so we can put some more useful software on it than the existing machine.

     I will do my best to minimize the outage time but as I’ve stated, the new interface may take some experimentation.

Work Tonight and Ongoing

This will happen sometime this evening, probably after 10PM but can’t be exact because of other commitments with unknown time frames.

I’m going to be taking the new server, Inuvik, down for about an hour or so to install some adapters for the drives I replaced last week.

The new drives have repurposed pin3 of the power connector, which used to be to provide +3.3 volts (modern drives need only +5 and +12) to now tell the drive to power down.

It would have made more sense to do it the other way, that way drives on old cables that only use that pin to provide +3.3 volts would by default be ON, but ya know, standard committees and all that.

So all these adapters do is sit between power and the drive to open pin 3 so the drives turn on and spin up like normal drives.

This will affect Debian, Manjaro, and some web services.