[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Mail Problems
- To: outages-list@eskimo.com
- Subject: Mail Problems
- From: Robert Dinse <nanook@eskimo.com>
- Date: Fri, 21 Aug 1998 19:52:01 -0700 (PDT)
- Resent-Date: Fri, 21 Aug 1998 19:51:22 -0700
- Resent-From: outages-list@eskimo.com
- Resent-Message-ID: <"OAYzY.0.3G4.d8Ztr"@mx1>
- Resent-Sender: outages-list-request@eskimo.com
WHAT HAPPENED TO MAIL:
This morning, August 21, a mail-loop resulted in the exhaustion of
disk space in the partition holding the mail spool.
The POP server did something strange and made all of the spool files
that were accessed via POP during this interval into 540MB files;
consisting mostly of NULLS.
There were approximately 260 users mail spool files affected in this
manner.
This is possible despite the total partition size being 2GB because
Unix allows files to have "holes" that don't actually have disk blocks
assigned and are filled with NULLS. Exactly why the POP client misbehaved
in this manner when disk space was exhausted, I can't say.
However; if someone tried to access a spool file in that state; it
would essentially stop the machine.
SHORT TERM FIX:
So I have moved all the affected spool files to a seperate directory
and wrote a small program to remove the nulls and when it finishes with
each user it will put the recovered messages back in your mail spool. At
the rate it is processing, this will take approximately 16 hours to
complete.
There were about 14 files that I accidentally blitzed when getting
this going and those are unrecoverable; but the majority will be recovered
sometime this evening.
LONG TERM FIX:
I do have some plans for addressing this in long term that also
address some security and speed and load concerns. This involves putting
the mail spool, ftp files; and user directories on a dedicated file
server.
There are numerous reasons for doing this but by putting them all on
one machine; quotas can be enforced across all of these things; and that
will prevent a mail loop like the one we experienced this morning from
exhausting all resources.
Other reasons include; by having stripping the partitions across multiple
drives disk I/O performance can be enhanced.
By using one large stripped partition, a lot of unnecessary disk I/O
involving moving data between various partitions can be eliminated.
It will simplify the complexity of NFS mounts between machines
considerably.
The file system won't be exported root; this will provide better security
for user files.
The file server won't be doing anything else; it should thus be possible
to provide a stable server for this function eliminating a lot of system-wide
outages. With eskimo and mail exporting files; both of which do a lot of other
things; when either crash it essentially locks up most other things.
It will remove NFS load from mail, eskimo, and ftp. The NFS load on mail
and eskimo is considerable and taking this load off will enhance performance
significantly. It will also provide faster web page serving.
It will make it possible to write backups and recover files without
loading servers that run user applications thus reducing the impact those
activities have on normal system operations.