Installing the Monitors for the ACS system
by Hiro Iwashima and Ryan Lee
Installing all of the standard monitors for your system can be a hassle of configuration. A small change can take hours to fix, and overall, it is not very helpful to the user. Therefore, we have attempted to simplify the documents and show step-by-step the steps necessary for a complete monitored system and also a happy sysadmin.
Table of Contents
Keepalive
Keepalive original docs
Keepalive makes sure that your server can be accessed regularly. If it can't, it'll perform specified actions.
- Grab the tar file from Arsdigita Download page
- untar it and move the directory as /web/keepalive
- change the ownership of the directory as nsadmin.
chown -R nsadmin.nsadmin /web/keepalive
- make a keepalive directory under /home/aol30/servers. Copy another directory by using
cp -pR [existing server] keepalive
- use a text editor to make /home/aol30/keepalive.ini or grab it from www.arsdigita.com/install/keepalive.ini. Set the correct address and hostname under [ns/server/keepalive/module/nssock]. Make sure to remove the line concerning nsssl if you don't have it. It's the last line in the [ns/server/keepalive/modules] section. If you are copying it from another ini file, make sure you have the following.
[ns/server/keepalive/modules]
nslog=nslog.so
nssock=nssock.so
nsperm=nsperm.so
Also, make sure the keepalive.ini file points to the correct log locations (the file currently available at ArsDigita.com states that logs are kept in /home/nsadmin/log, whereas our installation puts logs in /home/aol30/log. Replace occurrences of nsadmin (except User=nsadmin) with aol30)
- Make sure you have a restart-aolserver script in your /home/aol30/bin directory. If you don't have it, it's at the bottom of the page. Also make sure that /home/aol30/bin is in the path.
- Edit /web/keepalive/tcl/defs.tcl (update the parameters) and /web/keepalive/tcl/init.tcl (keepalive_init procedure). In init.tcl, add monitors in the same way as the sample. The arguments are, in order:
- name
- URL of test page
- expected return
- shell command to execute if failure: (restart-aolserver yourservername)
- TCL list of admin email addresses to notify
- TCL list of pager email address to notify
- (optional) number of retries before failure action is executed. This defaults to 5.
- (optional) threshold of retries below which email is sent. This defaults to the number of retries, meaning that Keepalive will send mail if there is any problem (if you feel that you're getting spammed about problems that work themselves out, set this to some lower number; we find that 4 and 2 are good numbers)
To make sure the restart-aolserver works, add /home/aol30/bin to your path in one of the start scripts, for example, /etc/profile. Source it again by running from prompt: . /etc/profile
- copy /web/yourservername/tcl/ad-utilities.tcl.preload into /web/keepalive/tcl/ad-utilities.tcl.preload and /web/yourservername/tcl/00-ad-preload.tcl into /web/keepalive/tcl/00-ad-preload.tcl. It will create a few error messages in your error log because it doesn't find some of the preload files that are in your server installation, but it doesn't really matter.
- insert keepalive into /etc/inittab to make sure it respawns.
nska:34:respawn:/home/aol30/bin/nsd -ic /home/aol30/keepalive.ini
- Now, you can start the process by typing
/home/aol30/bin/nsd -i -c /home/aol30/keepalive.ini
- To make sure it is running, go to /web/yourservername/www/SYSTEM/. Perform
mv dbtest.tcl dbtest.tcl.moved
This should make the aolserver fail, and send you email. Move it back when you are done.
Uptime
Uptime original docs
Uptime will make sure that your web server is up and running by checking it at designated intervals and performing the specified actions on it.
Sign up for Uptime
If the machine on which your service runs is down, the keepalive service on your machine will be down as well. Uptime resides on a separate server and sends alerts when your server can not be reached. You should use the forms at Uptime to register alerts to the following:
- All the people involved with your service
- noc@arsdigita.com
You should break your montoring page to make sure Uptime sends an alert. Then return the page to normal.
Watchdog
Watchdog original docs
Watchdog will check your error logs as designated intervals and send email of the error to the ones specified.
- Grab the tarfile at ArsDigita Download
- Untar it into /web/watchdog
- change the ownership of the directory as nsadmin.
chown -R nsadmin.nsadmin /web/watchdog
- grab the ini file from www.arsdigita.com/install/watchdog.ini and put it in /home/aol30
- modify the ini file
- make the server directory under /home/aol30/servers/watchdog (similar to keepalive)
- insert watchdog into the /etc/inittab
nswd:34:respawn:/home/aol30/bin/nsd -ic /home/aol30/watchdog.ini
- Goto http://yourserver:1998/ to add your server to the list.
- Create some tcl errors, make sure email is sent. The email is sent to the administrator, unless specified in /web/yourserver/parameteres/yourserver.ini file under [ns/server/emp530/monitoring]
Cassandrix
Cassandrix original docs
Cassandrix makes sure that you have enough disk space on your harddrive. If it starts to run out, it will send email alerts.
- Grab the tarfile at ArsDigita Download
- untar it somewhere
- change the ownership of the directory to be set to yourself.
chown -R yourself.yourself whatever
- Target machines:
- copy the files in the Cassandrix SYSTEM directory into /web/yourservername/www/SYSTEM directory.
- Master Machines:
- copy the files in the Cassandrix tcl to the server's private TCL library. Currently, there's only cx-defs.tcl
- copy the Cassandrix directory into /web/yourservername/www/cassandrix
- Make sure adp pages are enabled. In your nsd.tcl or nsd.ini in /home/aol30, make sure you have this:
[ns/server/markd/adp]
Map=/*.adp
- feed cassandrix.sql into Oracle.
sqlplus orauser/orapassword < cassandrix.sql
- Restart your aolserver:
restart-aolserver
- goto http://yourservername/cassandrix/index.adp and tell it which machines to monitor
- Host Name : the name of the host to be monitored. This is just used for putting a name with links on the various pages, and doesn't have to be a fully-qualified domain name.
- base URL : the base url from which to construct the /SYSTEM/* urls which generl through pager gateway. It's best to make this a generic subject since (if supplied) will be used as the subject for all alerts, including the "everything's OK" alert.
- custom email body : specialized email body to use on outgoing mail. Like the custom email subject, this is used for all a filesystems that are full are appended to the body.
- notification interval : how often to send mail complaining that disks are full. It doesn't make much sense to set this to be less than the monitor interval.
Cassandracle
Cassandracle original docs
Cassandracle monitors an Oracle installation. For this monitor, we want to use a more restricted Oracle driver, namely /home/aol30/ora8cass.so that was created when you installed the drivers. If it doesn't exist, then go to the ArsDigita oracle driver installation.
- Grab the tarfile at ArsDigita Download
- untar it into /web/ce
- copy /web/yourservername/tcl/ad-utilities.tcl.preload into /web/ce/tcl/ad-utilities.tcl.preload and /web/yourservername/tcl/00-ad-preload.tcl into /web/ce/tcl/00-ad-preload.tcl. It will create a few error messages in your error log because it doesn't find some of the preload files that are in your server installation, but it doesn't really matter.
- change users to oracle and specify to use that user's environment:
su orauser -
where orauser is your Oracle user
- Run the following at prompt:
svrmgrl
connect internal
create user cassandracle identified by *password* default tablespace yourtablespace temporary tablespace temp quota unlimited on yourtablespace;
grant connect, resource, dba to cassandracle;
grant select on V_$SQLTEXT to public;
exit
- run the following at prompt:
svrmgrl
connect internal
grant select on V_$SQLTEXT to public;
exit
- run the procedures in /web/ce/doc/helper-procedures.sql
sqlplus orauser/orapassword < /web/ce/doc/helper-procedures.sql
- get out of the oracle user
su nsadmin -
- make the ini file.
cp /home/aol30/yourserver.ini /home/aol30/ce.ini
- edit ce.ini
- change ora8.so to ora8cass.so
- correct the directories, the log file, and the oracle user to cassandracle
- delete the auxconfigdir line in [ns/parameters]
- change the Pageroot to /web/ce in [ns/server/ce
- add the line Port=1999 in [ns/server/keepalive/module/nssock]
- make the server directory under /home/aol30/servers/ce. (copy yourserver's server directory)
- insert into /etc/inittab
nsce:34:respawn:/home/aol30/bin/nsd-oracle -ic /home/aol30/ce.ini
- type init q to load it, go to http://yourserver:1999
MTA (Mail Transport Agent) Monitor
MTA original docs
This monitors a group of mail transport agents administred by one or more administrators. It basically connects every five minutes to each SMTP port, then also try to send a little mail every 15 minutes. If it fails, then it will send email to the appropriate email addresses.
- Grab the tarfile at ArsDigita Download
- make a directory (accessable by nsadmin) /web/mmon. untar it in that directory (the tarfile creates www, parameters, and tcl directories)
- Create the AOLserver install:
- feed the data model into Oracle. You can either run
sqlplus orauser/orapassword < /web/mmon/www/doc/sql/mmon.sql
or visit http://yourserver:8888/mmon/data-model.tcl (Keep your eyes on the error log to make sure it worked). If you have problems, they you can run http://yourserver:8888/mmon/drop-everything-user-with-care.tcl
- Edit bouncer.pl and receiver.pl in /web/mmon/www/mmon/. Fix server's hostname or IP address and to make sure whether the Perl executable is in /usr/bin or in /usr/local/bin
- Within your RedHat install, you should have sendmail.
- Create a special E-mail account (usually an alias) on every monitored server which calls bouncer.pl. You'll enter in this alias when set up a server to be monitored. The default name is mmon_bouncer.
- Create a special E-mail account on the monitoring server. That account should be configured to spawn receiver.pl. For example, if you are using qmail you can create a UNIX user and put in his home directory file called .qmail (not the leading dot) with a single line:
| /path-to/receiver.pl
With Sendmail you would add a line to /etc/aliases:
mmon-receiver: |/path-to/receiver.pl
- copy /web/yourservername/tcl/ad-utilities.tcl.preload into /web/mmon/tcl/ad-utilities.tcl.preload and /web/yourservername/tcl/00-ad-preload.tcl into /web/mmon/tcl/00-ad-preload.tcl. It will create a few error messages in your error log because it doesn't find some of the preload files that are in your server installation, but it doesn't really matter.
- edit /web/mmon/parameters/mmon.ini. For testing, you may want to set MinNotificationInterval,MinutesBetweenSMTPChecks and BounceTimeout to lower values to make sure that they work.
- edit /web/mmon/tcl/mmon-defs.tcl *****WHAT DO WE CHANGE HERE???******
- edit /web/mmon/parameters/mmon.ini: change the emails
- Restart AOLserver
- Visit http://yourserver:8888/mmon/server-add.tcl and add the required servers you'd like to have monitored
- Observe the server log and observe whether the MTA Monitor wakes up in the specified interval. Frequently reload http://yourserver:8888/mmon/controlpanel.tcl to see what's going on
- Simulate some problem with an MTA and see if the problems get reported. (i.e. change the SMTP port to a nonstandard value, or change the bouncer E-mail address to your own address) and make sure it gets reported
Appendix
Restarting AOL Server
We have a script, /home/aol30/restart-aolserver, which is necessary to run keepalive and some other things.
#!/usr/local/bin/perl
## Restarts an AOLserver. Takes as its only argument the name of the server to kill.
## This is a perl script because it needs to run setuid root,
## and perl has fewer security gotchas than most shells.
$ENV{'PATH'} = '/sbin:/bin';
# uncomment this stuff if you're at an installation where a server
# takes a long time to restart or keeps important state
# if (scalar(@ARGV) == 0) {
# die "Don't run this without any arguments!";
# }
$server = shift;
$< = $>; # set realuid to effective uid (root)
sub getpids {
## get the PIDs of all jobdirect servers
my $ps_output = `/usr/bin/ps -ef`;
my @pids;
foreach (split(/\n/, $ps_output)) {
next unless /^\s*\S+\s+(\d+).*nsd.*$server.ini/;
push(@pids, $1);
}
@pids;
}
@pids = &getpids;
print "Killing ", join(" ", @pids), "\n";
kill 'KILL', @pids;
Make sure that you have the correct version of ps in the line that says:
my $ps_output = `/usr/bin/ps -ef`;
You might want to make it `/usr/ps -ef`
or wherever your ps is. If you are confused, you can find out by typing which ps
at the prompt. Also, on some systems, it might be better to use -ewf option rather than -ef to make sure that ps doesn't truncate the text.