Summary of What Happened

     Over the past 24 hours we’ve had three server issues.  First, some changes to the network were necessary to accommodate a new router that will be replacing our existing router.  The new router does not do port based network address translation which we previously used to provide services from two different servers on a single IP address, specifically, eskimo.com provides shell access to an antique SunOS 4.1.4 based machine that also served as the NIS master and had a database I used for accounting.  That was shared with a different server that provides web services, so that if someone typed https://eskimo.com/ instead of https://www.eskimo.com/, they would still make it to our website.

     So to accommodate the new router (which has not yet been placed into service), I changed the name of the shell server to sunos.eskimo.com so eskimo.com could be pointed to the web server directly without the need of Network Address Translation.

     There were unintended consequences of this in that for some reason Sun’s calendar manager database, a database I’ve hijacked to use for accounting purposes, would not work with the name change.

     Well I gave it about five minutes of though and though this machine is 30 years old, if it dies parts are all made out of unobtainium, so best move these functions to modern servers.  The problems, well the database was proprietary so I can’t easily move it, so I dumped it as a plain text file and am hand inputting all the info into a new database that is linux based.  And I’ve wanted to switch NIS servers for a long time anyway because the SunOS server is only capable of eight character user names, 16 bit user IDs, 8 characters significant in the password which is not very secure, and triple DES encryption also not secure.  Where as the Linux NIS server can handle seven different types of encryption, all of which are superior to triple DES, 16 character long usernames, 32 bit user ID’s so enough for 3/4ths of the planets population if they all joined, and it would enable passwords to be changed by users directly again, well, it will once we get all the bugs squashed.  But there was an issue with the Linux implementation of NIS that I was not aware of until after the machine crashed this morning and that is that with SunOS, local passwords are in local files, and then the same files with the extension “.yp” get exported to the network.  There were also some format differences between SunOS’s passwd.adjunct file used for encrypted passwords and Linux’s “shadow”, but I figured those out and was able to use a few global exits in vi to fix.  But since Linux does not use separate files, the way it does it is system logins have uid’s < 1000, and user ID’s are 1000 and greater.  And all the data is in one file.

     I didn’t think this would be an issue as when I have uid’s both in local and network systems, I have them the SAME so it does not matter which mechanism a machine obtained them from, it would get the same values either way, however after reboot of the new master NIS server this morning everything went to hell and a hand basket.

     I had already segregated UID’s local verses system and all users have system id’s of 2000 or more, but I had not similarly segregated groups, and there in is where the problem lie.  So that meant deleting the old 50, 51, 52, 53, 54, 57 gids in use for various types of accounts and replacing them with groups above 1000.  Doing that for 500 users at once is non-trivial so that is why there was a period where the group id on your files did not resolve to a name.

     And by the way the drive to the co-lo and back was two hours each way owing to absolutely shit traffic on I-90, I-405, and I-5 so much of the drive went through back roads of Southern Bellevue, and for some reason their transportation engineers are utterly incapable of drawing a straight line and so my trajectory was anything but direct or fast, so my average round trip speed over 22 miles each way was about 10MPH.

     Then I get back and find the mail server is on it’s ass.  And I look at the logs and the authentication logs and mail logs are both complaining about “extra groups” and they were just those groups I had entirely removed from the network.  So apparently Dovecot which both provides imap and pop3, and also serves as authentication server for mail, it apparently caches some of this data somewhere then rather than just  update it’s fricken cache when it changes, it instead refuses to authenticate.  And I could not figure out where it was squirreling this data, so I completely purged and re-installed dovecot, only the challenge there was that they had changed the configuration significantly since I last installed it.

     So that was the cause of the outage today.  The good side is we are no longer relying on 30 year old hardware, once I get bugs worked out you will be able to change your passwords yourself, and longer usernames are now possible.  And we’ve eliminated anything hardware wise that will be an issue in 2038.  There are still some file systems in use with 64-bit inodes that only go to 2038 but all the new I am creating with larger inodes that can accommodate all of known history so by the time 2038 gets here is should be a non-event just like 2000, if I’m still alive (I will be 80 if so).

     For all of those who called, e-mailed, generated tickets, to let me know things were down, I appreciate that, but once aware I prefer to dedicate my energies to getting things back up, so please, in the future, check our website https://www.eskimo.com/news before calling, e-mailing, or generating a ticket.  Thank you.