by Ron Herardian
©1995 Global System Services Corporation (GSS)
DIRTY NETWORKS:
Contrary to popular belief data corruptions on networks
are commonplace. Further, network hardware and protocol stacks no not
necessarily detect errors.
Novell's IPX protocol, for example, contains no validation
mechanism by default. CRC checking can be enforced by the server but
in practice no one uses it because it noticeably slows down the server.
This means that hardware is the only thing between
the cc:Mail user agent and the post office. CRC errors, when detected,
are typically caused by a bad network card (when they have a single
point of origin, or by a cabling problem when distributed over the network.
Of course, we like to think that because there are
error detection mechanisms in place that undetected corruptions of data
transmitted across a network cannot occur. Unfortunately, this is just
not true. I suspect that an investigation into something as simple as
CRC checking in ISA network cards would yield alarming results, especially
when running in PCs with non-standard bus clocks (common in clone PCs)
and on cabling systems of various quality.
FAULT TOLERANT APPLICATIONS:
For most applications, minor data corruptions that
occur on the wire can go unnoticed indefinitely. So long as the data
on the server are intact, we are content to dismiss everyday errors
as attributable "cosmic rays." For a cc:Mail database, however,
even a minor corruption can have disastrous consequences.
NFT checking is basically page-level CRC checking in
the NFTCHECK, CLANDATA and USR files. This is a high-level detection
mechanism that reveals lower-level problems. A CRC is written to each
page of the cc:Mail database. There are four possible warnings or errors:
error prior to login, a warning or error reading; a warning or error
writing; and a warning or error verifying. When a warning or error occurs,
an entry is made in a file named NFTERROR.LOG stored in the post office
data directory.
ERRORS PRIOR TO LOGIN:
An error prior to login means that test data written
to and read back from the NFTCHECK file failed a CRC check. A cc:Mail
user agent detecting this error would not continue to access the post
office database because a corruption would likely result.
ERRORS READING:
An NFT warning reading means that a CRC embedded in
a given page of the cc:Mail database did not match the data received.
After retrying between 1 and 20 times, the CRC does match, therefore
a warning is logged. If after 20 attempts to read the data and check
the embedded CRC, the CRC still does not match, then an error reading
is reported. The point of course, is that if the CRC is correct for
a given page within the database file itself, it should still be correct
when the data arrive at the workstation.
The NFT error reading is often misunderstood because
when a cc:Mail database file is corrupted, user mail applications report
errors reading. This is because one or more CRCs do not match and will
never match so long as the corruption persists.
cc:Mail Support lore has it that NFT errors reading
are commonly associated with bad network cards and cabling problems.
In practice, the NFTERROR.LOG can be used to identify specifically what
cards are bad. It's not unusual for a cc:Mail database repair technician
to recommend changing a network card indicated by a particular user
repeatedly logging NFT errors.
In other cases, it may be observed that all of the
users appearing in the NFTERROR.LOG file are users whose workstations
are attached to a particular LAN segment.
NFT ERRORS WRITING:
An NFT error writing involves a read-back-and-compare
operation for data written to certain database files. If the compare
fails, then the data written to the database do not match the data in
memory at the workstation. This could be because a corruption has occurred
or because the data, although not corrupted within the database file,
are corrupted dynamically on each of attempt to read-back-and-compare,
up to the point where the cc:Mail application ceases to retry (historically
20 attempts).
An error writing often indicates that the cc:Mail database
has been corrupted. However, it is not impossible that data in the workstation's
memory are corrupted or that the data, though written faithfully to
the cc:Mail database on the server are corrupted dynamically on the
network during all of the attempts to read it back and compare it, or
at least the CRC, to the data in memory.
NFT ERRORS VERIFYING:
If after a write operation, a subsequent read for the
purpose of comparing the data written to the cc:Mail database to the
data in memory, the read operation (actually a read-back operation)
fails, this, after some number of attempts, constitutes an error verifying.
According to cc:Mail Support lore, NFT errors verifying
are often associated with bad disk media at the server and verify errors
of increasing frequency may indicate an impending server hard drive
failure.
According to cc:Mail, NFT errors are caused exclusively
by lower-level network errors and the cc:Mail applications are merely
detecting these errors. However, the cc:Mail applications are sometimes
the only indicator that a problem exists. Verification of these errors
depends upon independent confirmation through network analysis tools,
such as Network General Sniffer and Novell LANalyzer, that there are
in fact network errors.
IDENTIFYING THE REAL PROBLEM:
Often times, because cc:Mail software is the only indicator
that there is a problem, it is believed that cc:Mail software is actually
the cause of the problem, or even the cause of other network errors
that correlate positively with the incidence of NFT errors. Knowing,
how NFT checking works, however, makes the difficulty of proving that
cc:Mail is not of In practice, however, suggests that pointing to cc:Mail
when there are network problems is a questionable policy at best.
UNDERLYING NETWORK AND CC:MAIL SYSTEM DESIGN PROBLEMS:
When network bandwidth utilization is excessively high,
the probability of NFT errors is increased. Ethernet networks are susceptible
to this, because of the performance degradation characteristics of CSMA/CD
technology, when utilization regularly runs over 30%. This problem can
become critical for cc:Mail when there are cc:Mail system design problems
or when there are underlying network design, capacity, or routing problems
that interact with the cc:Mail system.
Common cc:Mail system design issues have to do with
the message routing topology, the configuration of cc:Mail Routers,
the deployment of post offices in a given LAN or WAN environment, and
so forth. Common network problems can involve bottlenecks caused by
network design itself, the deployment of servers in a given LAN or WAN
environment, the hardware and software configurations of network routers,
and so forth.
LONG-TERM SOLUTIONS:
Anyone who's been through a serious cc:Mail database
corruption wants a guarantee that it won't happen again. Identifying
and correcting the underlying problem, however, may require a review
of the cc:Mail system design and of the network design.
Long-term solutions might range from isolating building
or campus networks from segments that act as WAN backbones and upgrading
network routers, to moving post offices or adding new post offices or
file servers.
In well-designed network environments, NFT errors are
extremely rare or simply nonexistent. Experience indicates that the
more NFT errors there are, the more likely post office corruptions and
other cc:Mail problems.