Web Robot Detection
part of the ArsDigita Community System
by Michael Yoon
The Big Picture
Many of the pages on an ACS-based website are hidden from robots
(a.k.a. search engines) by virtue of the fact that login is required to
access them. A generic way to expose login-required content to robots is
to redirect all requests from robots to a special URL that is designed
to give the robot what at least appear to be linked files.
You might want to use this software for situations where public (not
password-protected) pages aren't getting indexed by a specific robot.
Many robots won't visit pages that look like CGI scripts, e.g., with
question marks and form vars (this is discussed in Chapter 7 of
Philip and Alex's Guide to Web Publishing).
The Medium-sized Picture
In order for this to work, we need a way to distinguish robots from
human beings. Fortunately, the Web
Robots Database maintains a list of active robots that they kindly
publish as a text
file. By loading this list
into the database, we can implement the following algorithm:
- Check the
User-Agent of each HTTP request against those of known robots (which are stored in the
robot_useragent column of the
- If there is a match, redirect to the special URL.
- This special URL can either be a static page or a dynamic script that dumps lots of juicy text from the database, for the robot's indexing pleasure.
This algorithm is implemented by a postauth filter proc:
(Note: For now, we are only storing the minimum number of
fields needed to detect robots, so many of the columns in the
robots table will be empty. Later, if the need presents
itself, we can enhance the code to parse out and store all fields.)
; the URL of the Web Robots DB text file
; which URLs should ad_robot_filter check (uncomment to turn system on)
; the URL where robots should be sent
; How frequently (in days) the robots table
; should be refreshed from the Web Robots DB
Notes for the Site Administrator
Though admin pages exist for this module, there should be no need to
use them in normal operation. This is because the ACS automatically
refreshes the contents of the
robots table at startup, if
it is empty or if its data is older than the number of days specified
RefreshIntervalDay configuration parameter (see
FilterPatterns are specified in the configuration,
then the robot detection filter will not be installed.
- build a non-password protected site starting at /robot-heaven/
(that's the default destination), using
necessary to create a pseudo static HTML file appearance
- specify directories and file types you want filtered and bounced
into /robot-heaven/ (from the ad.ini file)
- restart AOLserver
- visit the /admin/robot-detection/ admin page to see whether your
configs took effect
- view your server error log to make sure that the filters are getting
See the ACS Acceptance Test.