Eskimo North


          [Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

          Mail Problems


          • To: outages-list@eskimo.com
          • Subject: Mail Problems
          • From: Robert Dinse <nanook@eskimo.com>
          • Date: Fri, 21 Aug 1998 19:52:01 -0700 (PDT)
          • Resent-Date: Fri, 21 Aug 1998 19:51:22 -0700
          • Resent-From: outages-list@eskimo.com
          • Resent-Message-ID: <"OAYzY.0.3G4.d8Ztr"@mx1>
          • Resent-Sender: outages-list-request@eskimo.com

          
               WHAT HAPPENED TO MAIL:
          
               This morning, August 21, a mail-loop resulted in the exhaustion of
          disk space in the partition holding the mail spool.
          
               The POP server did something strange and made all of the spool files
          that were accessed via POP during this interval into 540MB files; 
          consisting mostly of NULLS. 
          
               There were approximately 260 users mail spool files affected in this
          manner.
          
               This is possible despite the total partition size being 2GB because
          Unix allows files to have "holes" that don't actually have disk blocks
          assigned and are filled with NULLS.  Exactly why the POP client misbehaved
          in this manner when disk space was exhausted, I can't say.
          
               However; if someone tried to access a spool file in that state; it
          would essentially stop the machine.
          
               SHORT TERM FIX:
          
               So I have moved all the affected spool files to a seperate directory
          and wrote a small program to remove the nulls and when it finishes with
          each user it will put the recovered messages back in your mail spool.  At
          the rate it is processing, this will take approximately 16 hours to
          complete. 
          
               There were about 14 files that I accidentally blitzed when getting
          this going and those are unrecoverable; but the majority will be recovered
          sometime this evening. 
          
               LONG TERM FIX:
          
               I do have some plans for addressing this in long term that also
          address some security and speed and load concerns.  This involves putting
          the mail spool, ftp files; and user directories on a dedicated file
          server. 
          
               There are numerous reasons for doing this but by putting them all on
          one machine; quotas can be enforced across all of these things; and that
          will prevent a mail loop like the one we experienced this morning from
          exhausting all resources. 
          
               Other reasons include; by having stripping the partitions across multiple
          drives disk I/O performance can be enhanced.
          
               By using one large stripped partition, a lot of unnecessary disk I/O
          involving moving data between various partitions can be eliminated. 
          
               It will simplify the complexity of NFS mounts between machines
          considerably.
          
               The file system won't be exported root; this will provide better security
          for user files. 
          
               The file server won't be doing anything else; it should thus be possible
          to provide a stable server for this function eliminating a lot of system-wide
          outages.  With eskimo and mail exporting files; both of which do a lot of other
          things; when either crash it essentially locks up most other things. 
          
               It will remove NFS load from mail, eskimo, and ftp.  The NFS load on mail
          and eskimo is considerable and taking this load off will enhance performance
          significantly.  It will also provide faster web page serving. 
          
               It will make it possible to write backups and recover files without
          loading servers that run user applications thus reducing the impact those
          activities have on normal system operations. 
          
          
          

          • Prev by Date: Chat
          • Next by Date: Mail & Web
          • Prev by thread: Mail & Web
          • Next by thread: Mail Problems
          • Index(es):
            • Date
            • Thread