ArsDigita Server Architecture Auditing

by Jin Choi, part of ArsDigita Server Architecture

This documents an audit procedure for a single physical server running the ArsDigita Server Architecture.

There are several systems involved in each installation:

Unix
Oracle
AOLserver
ArsDigita software

Unix

This is an independent audit from day-to-day monitoring via ArsDigita Cassandrix as specified in the ArsDigita Server Architecture.

Check that sufficient disk space is available using "df -k". Any partition close to 100% is bad, and should be brought back down at once.

On arsdigita.com, there is a script /usr/local/bin/check-on-stuff.pl that runs a set of ad-hoc Unix checks. We may wish to put a variant of this on all our current systems.

Backups

We do not currently have a universal system for doing backups. How they are carried out depends on which version of Unix it is running, what kind of tape drive it has, and whether it was set up before or after TechSquare started taking care of backups. To find out if backups are being carried out properly, it is necessary to su to root, run crontab -l, pick out the line which looks like it handles system backups, and check that file to see what software it uses, and what (if any) logging is being carried out.

If there is currently a tape in the drive, you can check the result of the backup directly. Most of our systems rely on a variant of "dump" for backups. To check out a backup on tape, you must:

Find out which device file represents the tape drive. On Solaris, this is generally /dev/rmt/0n, on HPUX, /dev/rmt/0mnb; check the backup script from crontab -l to see which one it uses. Set the TAPE environment variable to this device.
Use "mt" to rewind the tape: "mt rewind" on Solaris, "mt rew" on HP.
Use "restore" in interactive mode to poke around the drive: restore if <tape-device> "restore" might be ufsrestore (Solaris) or vxrestore (HP).
Try restoring a file. Make sure you are in a scratch directory of some sort, then use "add <filename>" to mark a files for restoration and "extract" to recover all marked files. Marking a directory will recursively recover the directory.
To check any partition except for the first, you will need to use "mt fsf" to fast forward the tape. Check the backup script to see what order the partitions are dumped in.

Oracle

Make sure Oracle is running and that we aren't bumping up against process limits. Try connect to Oracle using sqlplus.

Backups

Oracle backups are handled by doing consistent exports every evening. The location and times of these dumps differ from machine to machine. To find where the exports are going, run "crontab -l" as root to find the script which does the exports. Make sure that script is using the proper oracle system password. The latest export files in the export directory should be timestamped no earlier than some time the previous night. There should be sufficient disk space to store two copies of the latest versions of the exports, so that the next exports can happen (much more than two copies if they are compressed).

As these exports are done from root's cron, cron will send the results of the export to root. If you would like to receive these nightly mailings, add yourself as one of the recipients of the "root" mailing account (if running qmail, add your email address to /var/qmail/alias/.qmail-root).

Mail

Most of our newer systems run qmail as their mailer. If you want to figure out how to route mail around using qmail, read this.

To make sure qmail is functioning properly:

Check that the SMTP listener is running using "telnet localhost 25". Try sending yourself a message through the listener by typing in the following sequence:
```
helo
mail from: <some email address>
rcpt to: <your email address>
data
(type some stuff ended by "." on a line by itself)
quit
```
Check that mail is being delivered properly. "ps -ef | grep qmail" should show a number of mail processes running, especially qmail-qsend. Your test message should have gotten through.
If the qmail-send process is not running, qmail has died. The most common cause of this is running out of disk space. To check if the machine ran out of disk space recently, look in /var/adm/messages for "space".
Try running "/var/qmail/bin/qmail-qstat" as root. There may be a large number of messages in the queue, but there should only be a very small number of messages in queue but not yet preprocessed.

AOLservers

Grep for nsd in /etc/inittab to see what servers are supposed to be running. Make sure all of those servers are indeed running using "ps -ef | grep .ini".

ArsDigita services

We have three monitoring services that run as their own process on various servers: keepalive, rollover, and reporte.

Keepalive needs to be checked to see that it's actually checking the other servers by grepping for hits in the server access logs for /SYSTEM/dbtest.tcl.
Reporte needs to be checked to see that it's actually generating reports for each day by visiting each server's reports and visually verifying that they look like reasonable reports and none are missing.
rollover: see below

Logs

Logs tend to grow without bound unless checked. Various logs to check to make sure they aren't getting ludicrously big:

AOLserver error logs: "ls -l /home/nsadmin/log/*-error.log". If one of them is unusually large, make sure it is getting rolled (generally by the rollover service, sometimes from the keepalive service). To roll by hand, remove the file, then restart the aolserver which generates it.
Email logs: Usually /var/log/syslog. Might have been put somewhere else; check /etc/syslog.conf. Sometimes we just turn it off entirely, because it is so voluminous. To roll, remove the file, touch that file to recreate it (might not be necessary), and kill -HUP the syslogd process.

jsc@arsdigita.com

Add a comment | Add a link