Disasters can take many forms and, by their very nature, are unexpected. If DNS and mail are to continue to work, expecting the unexpected is vital. The kinds of disasters that one must anticipate vary from the mundane to the catastrophic:
A reboot or scheduled down-time for dumps on the mail or DNS server should only cause mail to be delayed, not lost.
A failed component on the mail or DNS server could cause mail delivery to be delayed anywhere from a few hours to a few days. A delay of over three to five days could cause many hosts to bounce queued mail unless steps are taken to receive that mail elsewhere.
Natural disasters can disrupt site or network connectivity for weeks. The Loma Prieta earthquake on the West Coast of the United States lasted only a few minutes but knocked out electric power to many areas for far longer. Fear of gas leaks prevented repowering many buildings for up to two weeks. A hurricane, flood, fire, or even an errant backhoe could knock out your institution for weeks.
When mail can't be received, whether because of a small event or a large disaster, an offsite MX host can save the day. An offsite MX host is simply another machine that can receive mail for your site when your site is unavailable. The location of the offsite machine depends on your situation. For a subdomain at one end of a microwave link, having an offsite host on the other side of the microwave might be sufficient. For a large site, such as a university, a machine at another university (possibly in a different state or country) would be wise.
Before we show how to set up offsite MX hosts, note that offsite MX hosts are not an unmixed blessing. If an offsite MX host does not handle mail reliably, you could lose mail. In many cases it is better not to have an offsite MX host than to have an unreliable one. Without an MX site, mail will normally be queued on the sending host. A reliable MX backup is useful, but an unreliable one is a disaster.
You should not unilaterally select a host to function as an offsite MX host. To set up an offsite MX host, you need to negotiate with the managers of other sites. By mutual agreement, another site's manager will configure that other machine to accept mail bound for your site (possibly queueing weeks' worth of mail) and configure that site to forward that mail to yours when your site comes back up.
For example, suppose your site is in the state of Iowa, in the United States. Further suppose that in Northern Japan there is a site with which you are friendly. You could negotiate with that site's manager to receive and hold your mail in a disaster. When the site is set up to do so, you first add a high-cost MX record for it:
mailhost.uiowa.edu. IN MX 2 mailhost.uiowa.edu. mailhost.uiowa.edu. IN MX 10 backup.uiowa.edu. mailhost.uiowa.edu. IN MX 900 pacific.north.jp. note
To be sure the MX works, send mail to yourself via that new MX site:
%mail you%mailhost.uiowa.edu@pacific.north.jp
[16]
Here, the %
in the address causes the message to first be delivered
to pacific.north.jp. That machine then throws away its
own name and converts the remaining %
to an @
. The
result is then mailed back to you at
[16] This example presumes that pacific.north.jp can handle the
%
"hack." Most places do, so this is probably a safe assumption.
you@mailhost.uiowa.edu
This verifies that the disaster MX machine can get mail to your site when it returns to service.
During a disaster the first sign of trouble will be mail for your site suddenly appearing in the queue at pacific.north.jp. The manager there should notice and set up a separate queue to hold the incoming mail until your site returns to service (see Section 23.7.1, "Handling a Down Site"). When your site recovers, you can contact that manager and arrange for a queue run to deliver the backlog of mail.
If your site is out of service for weeks, the backlog of mail might be partly on tape or some other backup media. You might even want to negotiate an artificially slow feed so that your local spool directory won't overfill.
Even in minor disasters an MX host can save much grief because delivery will be serialized. Without an MX host, every machine in the world that had mail for your machine would try to send it at the same time, that is, when your machine returns to service. That could overload your machine and even crash it, causing the problem to repeat over and over.
A disaster MX is good only as long as your DNS services stay alive to advertise it. Most sites have multiple name server machines to balance the load of DNS lookups and to provide redundancy in case one fails. Unfortunately, few sites have offsite name servers as a hedge against disaster. Consider the disaster MX record developed above:
mailhost.uiowa.edu. IN MX 900 pacific.north.jp.
Ideally, one would want pacific.north.jp to queue all mail until the local site is back in service. Unfortunately, all DNS records contain a Time To Live (TTL) that may or may not be present in the declaration line:
mailhost.uiowa.edu. IN MX 900 pacific.north.jp. TTL implied mailhost.uiowa.edu. 86400 IN MX 900 pacific.north.jp. TTL specified as 24 hours in seconds
When other sites look up the local site, they cache this record. They will not look it up again until 24 hours have passed. Therefore if an earthquake strikes, all other sites will forget about this record after 24 hours and will not be able to look it up again.
In general, records set up for disaster purposes should be given TTLs that are over a month:
mailhost.uiowa.edu. 3600000 IN MX 900 pacific.north.jp. TTL specified as 41 days in seconds
But note that TTLs should be the same for all records so that they will all time out the same:
mailhost.uiowa.edu. 3600000 IN MX 2 mailhost.uiowa.edu. mailhost.uiowa.edu. 3600000 IN MX 10 backup.uiowa.edu. mailhost.uiowa.edu. 3600000 IN MX 900 pacific.north.jp.
If you gave the disaster record a long TTL and left the default for your normal MX records, your normal records would time out and disappear from other sites' caches. This would result in all mail suddenly and mysteriously going to the disaster host when there was no disaster to cause it.
Note that long TTLs can make updates to your DNS files awkward. Updates won't take effect until the TTL times out. If you anticipate a future change, say a rearrangement of your MX records, you can change the TTLs to 2 hours, wait a month for the long TTL to time out, then make and test your changes. [17]
[17] And hope that no disaster strikes in the meanwhile. A better technique is to set up an offsite secondary DNS server with a large TTL in the SOA record.
If many hosts at your site receive mail (rather than a central mail server), it is necessary to add a disaster record for each. Unfortunately, when the number of such hosts at your site is greater than 100 or so, individual disaster MX records become difficult to manage simply because of scale.
At such sites, a better method of disaster preparedness is to set up pacific.north.jp as another primary DNS server for the local site. There are two advantages to this "authoritative" backup server approach:
An offsite primary server eliminates the need to set up individual MX disaster records.
An out-of-country primary server can lower the network impact of DNS lookups of your site.
Unfortunately, setting up an offsite or out-of-country server can be extremely difficult. We won't show you how to do that here. Instead, we refer you to the book DNS and BIND by Paul Albitz and Cricket Liu (O'Reilly & Associates, 2nd edition, 1997).