This message is being posted to a couple of the larger mailing lists on
hyperreal to which I am subscribed - if you are a list admin and want to
post it to your list, then feel free, but there's no need to start a
whole lot of discussion about it in public - please send any comments to me.
Many of you have probably noted problems on hyperreal recently - the
memory upgrade helped address some of the serious server performance
problems, but in particular did not help bandwidth at all, and also did
not help a problem with majordomo and mailing list delivery, with blank
messages, multiple messages, missing messages, etc. To compound
problems, I was out of town for most of this week, and did not get to see
most of the problems while they were happening. So, here is a synopsis
of the current state of things, and how they are being addressed.
Bandwidth
Right now as most of you know, Hyperreal shares two T1s to Sprintlink
with: Organic, HotWired, Wired, Suck, BiancaTroll, bigbook.com, apache.org,
and a bunch of other sites. Up until this last week the bandwidth
picture wasn't pretty, but it was "alright" - at peaks during the day the
link to sprintlink would be 75% loaded, and many times Sprintlink
themselves would be flaky if not hosed. But it was manageable. Within
the last week, though, two things have happened: www.levi.com, run by
Organic, made it to Netscape's "what's cool!" page, and with its massive
server-pushes, screamed to the tune of gigabytes per day over the
connection. Secondly, bigbook.com, an Organic-related company, launched
with a lot of press a yellow-pages service which has also been drawing a
large number of hits. So, the T1's to the building are now maxed - for a
while earlier today each T1 was delivering 1.49 mbits/sec, with the
theoretical max of a T1 at 1.54 megabits/sec. Ugly!
The solution: Organic is getting its two T1's in about two weeks. This
has been in the planning stages for several months, which is the average
time frame for getting new bandwidth. We are also getting two more T1's
later to a different provider, allowing us some redundancy. This will
happen soon, but not immediately. In the short term, the impact from
Levi's is being turned down, but there will still be some significant lag
to the system, so I recommend that people avoid even trying to read mail
or do any other type of interactive communication with hyperreal.
Cruising the web site is alright, but trying to write an email in pine is
just impossible. I know, I was trying to do that from the IETF meeting. :)
If the situation remains in the critical section, I may take action such
as turning off the mail daemon on hyperreal in the middle of the day, or
turning off immediate delivery for mailing lists, and then turning it on
at night for later delivery. These are drastic, but temporary measures.
Hell, I'm pushing for a T3. :)
Mailing list problems
Most of hyperreal's work load is based around the number of mail
deliveries it does on a given day. On a slow day hyperreal delivers 20K
messages. On a busy day it delivers well over 100K. This punishes the
operating system at a pretty deep level - and the problem I've seen is
that sendmail can't pass the message off to the program to do what it
needs to do, generating an error message of "Cannot fork()". It has been
difficult to track down the exact reason why it can't fork, and while I
think I have finally nailed it down today (CHILD_MAX in the BSDI kernel
config, I think the "daemon" user is exceeding it, for the techies out
there) I'm still not sure. I will be upgrading to 2.1 this weekend, and
finding more swap space, but this is an ongoing battle I thought I had
solved when I bought more memory two weeks ago.
Anyways, if you sent mail to hyperreal more than 12 hours ago and it has
not shown up on the list yet (and mail usually gets delivered pretty
quickly for you) then consider submitting it again, it may be lost.
Hopefully this kernel config has fixed it, but we'll find out next week,
since we usually don't hit this problem during the weekend.
That's it.
Brian