Eskimo North


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Mail Problems




     I didn't post this earlier because additional mail traffic would have
increased the recovery time.

     The mail problems today are unrelated to the memory error that
occured on mail.eskimo.com last night.

     A mail loop ate up all the spool space on mail, resulting in the
corruption of about 450 mailboxes.  I have a program that attemps to fix
this, and usually is successful, but owing to the fact that when this
happens it creates mailboxes with 540mb each of mostly holes, it is
exceedingly slow.  It did take about five minutes per box.  I modified it
today to improve that somewhat and at this point your mail should be
recovered, with the exception of three boxes that were hurt by me typing
something stupid.  Those three are restoring from a tape that was written
this morning. 

     Because it would take far too long to go through an entire 540mb file
of mostly holes for each mailbox, the program attempts to make some
educated guesses based on my own past experience with this problem.  It
copies data in the file up to where it starts getting nulls, skipping any
nulls.  When it gets several consequtive megabytes of nulls, it skips to
the end of the file, backs up about a megabyte, and reads forward from
there.

     The reason it does this this way is because to search the entire file
would take about 1/2 hour for each mailbox.  When it gets screwed up, what
happens is the spool is full, and for some reason that I have not been
able to identify yet, the system ends up keep trying to add blocks until
the file is around 540mb, then if some tmp file is removed making
additional room, it may add some additional valid data onto the end. 

     There is the potential for a message to get partially written and
then fail because it runs out of space, leaving half a message on the
spool, but that "should" report a failure status to the sending site
causing it to be requeued and re-sent later.

     If anybodies mailbox is unrecoverably corrupted, we can restore from
a tape written around 5am this morning but obviously I didn't want to do
that for everybody if we could recover without going to tape because that
would lose incoming mail from about 5am until 11:30 or thereabouts when
the spool filled up. 

     I did lose some e-mail to support also.  Support mail is normally
processed by procmail written to files in user space from where it is
read.  When supports quota was reached, procmail was unable to process the
mail and left it in the system spool (which filled up and died).  In order
to get the mail server back up as quickly as possible I blitzed this file
(it was mostly bounce messages anyway).  Then when deleting the bounce
messages in the files processed by pine, I hit too many 'd's and deleted a
few user inquiries as well.  So if you do not get a response from mail
sent to support today between the hours of about 5am and 3pm, please
re-send. 

     The mail problem cascaded to some of the other servers.  Because
people were trying to read corrupted mailbox of 540mb on Eskimo, this
caused a severel load on Eskimo that caused NFS timeouts for the web
servers.

     Linux NFS has a bug that causes it sometimes not to recover from an
NFS timeout (so much for a supposedly 'stateless' protocol), and that bug
caused the we servers to fail.  Since we were very pre-occupied with
resolving the mail problems, and people were jamming the phone lines to
call about mail, the web servers went unnoticed for some time. 

     Ultimately, the file server we're trying to put together will prevent
this problem by enforcing quota on mail spools, but this is really a major
problem when it occurs.  Mail loops are not entirely avoidable, but this
behavior where the file gets extended out to 540mb of mostly holes should
be.  However, I do not understand the nature of it, but I didn't see this
behavior prior to the installation of sendmail 8.9.x so I suspect it's
related to that.