[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Eskinews
- To: outages-list@eskimo.com
- Subject: Eskinews
- From: Robert Dinse <nanook@eskimo.com>
- Date: Tue, 10 Feb 1998 00:51:32 -0800 (PST)
- Resent-Date: Tue, 10 Feb 1998 00:51:41 -0800
- Resent-From: outages-list@eskimo.com
- Resent-Message-ID: <"BF89i.0.BR7.SK1uq"@mx1>
- Resent-Sender: outages-list-request@eskimo.com
The non-technical version:
Eskinews has flaws on one of its disks and the utilities provided in
Linux for dealing with them don't work. The lengthy downtime was the
result of these errors and various attempts I made to lock out bad
portions of the disk with utilities that don't work. Because it is not
possible to lock out the bad parts of the disk it is going to be necessary
to replace the disk and will lose the current news on the server as a
result.
The more technical version:
One of the spool drives on Eskinews has about 16 media flaws. If
this had been running SunOS it wouldn't have been a major problem, fire up
format, lock out the bad blocks, and life goes on. But SunOS supports a
maximum partition size of 2gb which is insufficient for News, and does not
support disk stripping which is necessary to provide adequate disk I/O
capacity for Usenet News processing at current volumes.
Linux rather than using the drives bad block list, assigns an inode
for bad blocks and basically creates an invisible file of bad blocks.
Where this scheme fails miserably is if those blocks happen to be inode
blocks instead of file storage blocks. If the blocks are inodes, rather
than relocate them, or simply mark those inodes as unusable, the utility
used to map out bad blocks and fix the file system (e2fsck) finds that it
no longer has room for the inodes, freaks, and starts over from the
beginning, and basically it will go in this loop indefinitely. So the
bottom line, with SCSI-II disks, there is no way to lock out a bad block
under Linux. A SCSI-III disk maintains a bad block list and reserves
alternate tracks internally and handles bad block reassignments internally
but the disks we presently have for spool drives do not.
I'm going to have to replace this drive on Eskinews. When we added a
spool drive, just restoring 3gb of news spool files took three days. At
the current rate of more than 2gb of news/day, restoring 5 days worth of
news would take well over a week so basically there is no point to doing
it. The spool will have to be zeroed in the process of replacing the
drive.
It took me as long as it did to arrive at the conclusion that I
couldn't lock out these blocks because each iteration of fsck on the spool
partition takes about an hour owing to the size of that partition (17gb)
and the number of directories and files (more than a million).
So at this point I am going to pick up a Western Digital replacement
which is also SCSI-III and maintains it's own bad block list and replace
the existing drive as soon as I can get a replacement.
There is some additional instability involving INND just
spontaneously dying and I think that is related to the recent kernel
upgrade because it started right after we upgraded the kernel. I have put
the machine back on the old drive to test the theory.