[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Mail & Web
- To: outages-list@eskimo.com
- Subject: Mail & Web
- From: Robert Dinse <nanook@eskimo.com>
- Date: Sat, 29 Aug 1998 22:01:37 -0700 (PDT)
- Resent-Date: Sat, 29 Aug 1998 22:00:39 -0700
- Resent-From: outages-list@eskimo.com
- Resent-Message-ID: <"XYWck.0.gY.rnDwr"@mx1>
- Resent-Sender: outages-list-request@eskimo.com
I apologize for not posting this sooner but have been up to my eyeballs
today to put it mildly.
We had a user that wrote some procmail rules intended to bounce spam
to the postmaster of the originating domain.
About 99% of the time spammers forge the headers so the domains or
reply addresses aren't valid. So the bounced message was bounced back
where his procmail rules again decided it was spam and bounced it. The
end result was a mail loop that consumed all of the mail spool space
during the night.
Like the last time a mail-loop ran the spool out of space; the end
result was that some mailboxes were corrupted with about 540mb of NULLS
being added to them. This caused massive problems when someone tried to
access their mail on both the mail server and on eskimo. When someone
tried to access the mail on the mail server; the pop server has to copy
their mailbox (including the 540 megabytes of nulls) and this pretty much
ties up that machine. If someone accesses their mail from eskimo; then it
tries to access that 540mb mailbox across the network; and despite both
machines having FDDI interfaces; this quickly ties up both machines.
A bug exists in SparcLinux NFS, where sometimes when an NFS request
times out, NFS just stops talking and won't talk again until the nfs
daemon is restarted or sometimes until the machine is rebooted. This is
what killed the main web server earlier today.
Nobody called about the web server for a number of hours and I didn't
notice it because I was busy fixing the mail problems. No mailbox files
were lost this time though some are still being processed to remove the
NULLS.
I had to take pop-3 down earlier in order to fix this which is why it
was refusing connections for a while.
Long term solution is still in the works that will put mail spool,
user files, and ftp files on one file server so quotas can be enforced on
mail spool, which will keep one users errant process from filling up the
entire spool.
Jimmie and I are also working on an alarm scheme for various services
here so that we will become aware of failures sooner.