This chapter is about ways to group all the code for a module, to record the existence of documentation for that module, to publish APIs to other parts of the system, and methods for storing configuration parameters.
bidcolumns to an existing content table makes more sense than constructing new tables and that adding a lot of IF statements to the scripts that present discussion questions and answers makes more sense than writing new scripts.
If the online community is used to support a group of university students and teachers, additional specialized modules would be added, e.g., for recording which courses are being taught by whom and when, which students are registered in which courses, what handouts are associated with each class, what assignments are due and by when, and what grades have been assigned and by which teachers.
Recall that the software behind an Internet service is frequently updated as the community grows and new ideas are developed. Frequently updated software is going to have bugs, which means that the system will be frequently debugged, oftentimes at 2:00 am and usually by a programmer other than the one who wrote the software. It is thus important to publish and abide by conventions that make it easy for a new programmer to figure out where the relevant source code files are. It might take only fifteen minutes to figure out what is wrong and patch the system. But if it takes three hours to find the source code files to begin with, what would have been an insignificant bug becomes a half-day project.
Let's walk through an example of how the software is arranged on the
photo.net service. The server is configured to operate multiple Internet
services. Each one is located at
which means that all the directories associated with photo.net are
/web/photonet/. The page root for the site is
/web/photonet/www/. The Web server is configured to look
for "library" procedures (shared by multiple pages) in
/web/photonet/tcl/, a name derived from the fact that
photo.net is run on AOLserver,
whose default extension language is Tcl.
RDBMS table, index, and stored procedure definitions for a module are
stored in a single file in the
(directory names in this chapter are relative to the Web server page root
unless specified as absolute). The name for this file is the module
name followed by a
.sql extension, e.g.,
chat.sql for the chat module. Shared procedures for all
modules are stored in the single library directory
/web/photonet/tcl/, with each file named
Scripts that generate individual pages are parked at the following
/module-name/ for the user pages;
/module-name/admin/ for the moderator pages, e.g., where
a user with moderator privileges would go to delete a posting;
/admin/module-name/ for the site administrator pages,
e.g., where the service operator would go to enable or disable a
service, delegate moderation authority to another user, etc.
A high-level document explaining each module is stored in
/doc/module-name.html and linked from the index page in
/doc/. This document is intended as a starting point for
programmers who are considering using the module or extending a
feature of the module. The document has the following structure:
As we'll see in the "Scaling Gracefully" chapter, there are some performance advantages to be had in splitting off the presentation layer of an application into a set of separate physical computers. Thus our page scripts will most definitely reside outside of the RDBMS. This gives us the opportunity to write additional software that will run within or close to the Web server program, typically in the same computer language that is used for page scripting, in the form of shared procedures. In the case of a PHP script, for example, a shared procedure could be an include file. In the case of a site where individual pages are scripted in Java or C#, a shared procedure might be some classes and methods used by multiple pages.
How do you choose between using shared procedures and stored procedures? Start by thinking about the multiple applications that may connect to the same database. For example, there could be a public Web server, a nightly program that pulls out all new information for analysis, a maintenance tool for administrators built on top of Microsoft Excel or Access, etc.
If you think that a piece of code might be useful to those other systems that connect to the same data model, put it in the database as a stored procedure. If you are sure that a piece of code is only useful for the particular Web application that you're building, keep it in the Web server as a shared procedure.
"As we enter the 21st century we find that rifle marksmanship has been largely lost in the military establishments of the world. The notion that technology can supplant incompetence is upon us in all sorts of endeavors, including that of shooting."Given a system with 1000 procedures and no documentation, the typical manager will lay down an edict to the programmers: you must write a "doc string" for every procedure saying what inputs it takes, what outputs it generates, and how it transforms those inputs into outputs. Virtually every programming environment going back to the 1960s has support for this kind of thinking. The fancier "doc string" systems will even parse through directories of source code, extract the doc strings, and print a nice-looking manual of 1000 doc strings.
-- Jeff Cooper in The Art of the Rifle (1997; Paladin Press)
How useful are doc strings? Useful, but not sufficient. The programmer new to a system won't have any idea which of the 1000 procedures and corresponding doc strings are most important. The new programmer won't have any idea why these procedures were built, what problem they solve, and whether the whole system has been deprecated in favor of newer software from another source. Certainly the 1000 doc strings aren't going to convince any programmers to adopt a piece of software. It is much more important to present clear English prose that demonstrates the quality of your thinking and design work in attacking a real problem. The prose does not have to be more than a few pages long, but it needs to be carefully crafted.
Suppose that you've built your Web application in the simplest and most direct manner. For each URL there is a corresponding script, which contains SQL statements, some procedural code in the scripting language (IF statements, basically), and static strings of HTML that will be combined with the values returned from the database to form the completed page. If you break down what is inside a Visual Basic Active Server Page or a Java Server Page or a Perl CGI script, you always find these three items: SQL, IF statements, HTML.
Development of an application with this style of programming is easy. You can see all the relevant code for a page in one text editor buffer. Maintenance is also straightforward. If a user sends in a bug report saying "There is a spelling error on http://www.yourcommunity.org/foo/bar" you know that you need only look in one file in the file system (/foo/bar.asp or /foo/bar.jsp or /foo/bar.pl or whatever) and you are guaranteed to find the source of the user's problem. This goes for SQL and procedural programming errors as well.
What if people want site-wide changes to fonts, colors, headers and footers? This could be easy or hard depending on how you've crafted the system. Suppose that default colors are read from a configuration parameter system and headers, footers, and per-page navigation aids are generated by the page script calling shared procedures. In this happy circumstance, making site-wide changes might take only a few minutes.
What if people want to change the wording of some annotation in the static HTML for a page? Or make a particular headline on one page larger? Or add a bit of white space in one place on one page? This will require a programmer because the static HTML strings associated with that page are embedded in a file that contains SQL and procedural language code. You don't want someone to bring a section of the service down because of a botched attempt to fix a typo or add a hint.
Suppose that File 1 is named
index.pl and is a Perl
script. By convention, File 2 will be named
index.template. In preparing a template, a designer needs
to know (a) the names of the variables being set in index.pl, (b) that
one references a variable from the template with a dollar sign, e.g.,
$standard_navbar, and (c) that to send an actual dollar
sign or at-sign character to the user it should be escaped with a
backslash. The merging of the template and local variables
established in index.pl can be accomplished with a single call to
eval procedure, which performs standard
Perl string interpolation, i.e., replacing
$foo with the
value of the variable
/web/photonet/www/and templates underneath
/web/photonet/templates/. A script at
/e-commerce/checkout.tclfinishes by calling the shared procedure
return_template. This procedure first invokes the Web server API to find out what URI is being served. A configuration parameter specifies the start of the templates tree.
return_templateuses the URL plus the template tree root to probe in the file system for a template to evaluate. If found, the template, in AOLserver ADP format (same syntax as Microsoft ASP), is evaluated in the context of
return_template's caller, which means that local variables set in the script will be available to the ADP file.
The "medium hammer" approach keeps programmers and designers completely separated from a file system permissioning point of view. It also has the advantage that the shared procedure called at the end of every script can do some poking around. Is this a user who prefers text-only pages? If so, is there a text-only template available? Is this a user who prefers a language other than the site's default? If so, is there a template available in which the annotation is in the user's preferred language?
register.tclwould look for
master.templatefile in the same directory as the script; if found, the content rendered by the page script and its corresponding template would be substituted for the
<slave>tag in the master template and the result of evaluating the master template returned to the user
This sounds inefficient due to the large number of file system probes. However, once a system is in production, it is easy for the Web server to cache, per-URL, the results of the file system investigation. In fact, the Web server could cache all of the templates in its virtual memory for maximum speed. The reason that one wouldn't do this during development is that it would make debugging difficult. Every time you changed a template you'd have to restart the Web server or clear the cache in order to view the results of the change.
Given that all site content is stored in relational database tables, the most obvious way to start writing the user contributions page script is by looking at the SQL data models for each individual module. Then we can write a program that queries a few dozen tables to find all contributions by a particular user.
A drawback to this approach is that we now have code that may break if we change a module's data model, yet this code is not within that module's subdirectory, and this code is probably being authored by a programmer other than the one maintaining the individual module.
Let's consider a different application: email alerts. Suppose that your community offers a discussion forum and a classified ad system, coded as separate modules. A user wishes to get a daily summary of activity in both areas. Each module could offer a completely separate alerts mechanism. However, this would mean that the user would get two email messages every night when a single combined email was desired. If we build a combined email alert system, however, we have the same problem as with the user history page: shared code that depends on the data models of individual modules.
Finally, let's look at the site administrator's job. The site administrator is probably a busy volunteer. He or she does not want to waste twenty mouse clicks to see today's new content. The site administrator ought to be able to view recently contributed content from all modules on a single page. Does that mean we will yet again have a script that depends on every table definition from every module?
Here's a hint at a solution. On the photo.net site each module defines a "new stuff" procedure, which takes the following arguments:
since_when— the date of the earliest content we're interested in
only_from_new_users_p— a boolean indicating whether or not we want to limit the report to contributions from new users (useful for site administration because new users are the ones who don't understand community standards and norms)
purpose— "admin", "email_summary", or "user"; this controls delivery of unapproved content, inclusion of links to administration options such as approval/disapproval, and the format of the report
Where should you store parameters such as these? Except for the database username and password, an obvious answer would seem to be "in the database." There are a bunch of keys (the parameter names) and a bunch of values (the parameters). This is the very problem for which a database management system is ideal.
At the end of every page script we can query these tables:
-- use Oracle's unique key generator create sequence config_param_seq start with 1; create table config_param_keys ( config_param_key_id integer primary key, key_name varchar(4000) not null, param_comment varchar(4000) ); -- we store the values in a separate table because there might -- be more than one for a given key create table config_param_values ( config_param_key_id not null references config_param_keys, value_index integer default 1 not null, param_value varchar(4000) not null ); -- we use the Oracle operator "nextval" to get the next -- value from the sequence generator insert into config_param_keys values (config_param_seq.nextval, 'view_source_link_p', 'damn 6.171 instructor is making me do this'); -- we use the Oracle operator "currval" to get the last -- value from the sequence generator (so that rows inserted in this transaction -- will all have the same ID) insert into config_param_values values (config_param_seq.currval, 1, 't'); commit; insert into config_param_keys values (config_param_seq.nextval, 'redirect', 'dropping the /wtr/ directory'); insert into config_param_values values (config_param_seq.currval, 1, '/wtr/thebook/'); insert into config_param_values values (config_param_seq.currval, 2, '/panda/'); commit;
If the script gets a row with "t" back, it includes a "View Source" link at the bottom of the page. If not, no link.
select cpv.param_value from config_param_keys cpk, config_param_values cpv where cpk.config_param_key_id = cpv.config_param_key_id and key_name = 'view_source_link_p'
Recording a redirect required the storage of two rows in the
config_param_values table, one for the "from" and one for
the "to" URL. When a request comes in, the Web server will want to
query to figure out if a redirect exists:
select cpk.config_param_key_id from config_param_keys cpk, config_param_values cpv where cpk.config_param_key_id = cpv.config_param_key_id and key_name = 'redirect' and value_index = 1 and param_value = :requested_url
:requested_urlis a bind variable containing the URL requested by the currently-connected Web client. Note that this query tells us only that such a redirect exists; it does not give us the destination URL, which is stored in a separate row of
config_param_values. Believe it or not, the conventional thing to do here is a three-way join, including a self-join of
select cpv2.param_value from config_param_keys cpk, config_param_values cpv1, config_param_values cpv2 where cpk.config_param_key_id = cpv1.config_param_key_id and cpk.config_param_key_id = cpv2.config_param_key_id and cpk.key_name = 'redirect' and cpv1.value_index = 1 and cpv1.param_value = :requested_url and cpv2.value_index = 2 -- that was pretty ugly; maybe we can encapsulate it in a view create view redirects as select cpv1.param_value as from_url, cpv2.param_value as to_url from config_param_keys cpk, config_param_values cpv1, config_param_values cpv2 where cpk.config_param_key_id = cpv1.config_param_key_id and cpk.config_param_key_id = cpv2.config_param_key_id and cpk.key_name = 'redirect' and cpv1.value_index = 1 and cpv2.value_index = 2 -- a couple of Oracle SQL*Plus formatting commands column from_url format a25 column to_url format a30 -- let's look at our virtual table now select * from redirects; FROM_URL TO_URL ------------------------- ------------------------------ /wtr/thebook/ /panda/
N-way joins notwithstanding, how tasteful is this approach to storing parameters? The surface answer is "extremely tasteful." All of our information is in the RDBMS where it belongs. There are no magic numbers in the code. The parameters are amenable to editing from admin pages that have the same form as all the other pages on the site: SQL queries and SQL updates. After a little more time spent with this problem, however, one asks "Why are we querying the RDBMS one million times per day for information that changes once per year?"
Questions of taste aside, an extra five to ten RDBMS queries per request is a significant burden on the database server, which is the most difficult part of an Internet application to distribute across multiple physical computers (see the "Scaling" chapter) and therefore the most expensive layer in which to expand capacity.
A good rule of thumb is that Web scripts shouldn't be querying the RDBMS to figure out what to do; they should query the RDBMS only for content and user data.
For reasonable performance, configuration parameters should be accessible to Web scripts from the Web server's virtual memory. Implementing such a scheme with a threaded Web server is pretty straightforward because all the code is executing within one virtual memory space:
If you expect to have a lot of configuration parameters, it might be
best to add a "section" column to the
table and query by section and key. Thus, for example, you can have a
parameter called "bug_report_email" in both the "discussion" and
"user_registration" sections. The key to the hash table then becomes
a composite of the section name and key name.
For example, if you place the following in
c:\Inetpub\wwwroot\Web.config (assuming default IIS installation)
you will be able to access publisherEmail in a VB .aspx page as follows
<configuration> <appSettings> <add key="publisherEmail" value="firstname.lastname@example.org" /> </appSettings> </configuration>
By default, configuration settings apply to a directory and all its subdirectories. Also by default, these settings can be overridden by settings in Web.config files in the subdirectories. More elaborate rules for scoping and override behavior can be established using the <location> tag.
<% Dim publisherEmail as String publisherEmail = ConfigurationSettings.AppSettings( "publisherEmail" ) %> <html> <body> ... For further information please contact us at <%= publisherEmail %> ... </body> </html>
Specify Parameter tags within the Context specification for your application in conf/server.xml. Example:
<Context path="/myapp" docBase="myapp" debug="0" reloadable="true" crossContext="true"> <Parameter name="companyName" value="My Company, Inc." override="false"/> </Context>
You can also specify the parameter in the WEB-INF/web.xml file for your application:
The "override" attribute in the first example specifies that you do not want this value to be overridden by a context-param tag in the web.xml file. The default value is "true" (allow overrides).
<context-param> <param-name>companyName</param-name> <param-value>My Company, Inc.</param-value> </context-param>
To retrieve parameters from a servlet or JSP, you can call:
/doc/directory on your team server. Create an index page in this directory that links to a development standards document (
/doc/development-standardswould be a reasonable URL but you can use whatever you like so long as it is clearly linked from
In this development standards document, cover at least the following issues:
/foo/bar, /foo/bar-1, /foo/bar-2, etc.
/doc/directory, perhaps at
/doc/intermodule-API, linked from the doc index page. Your strategy must be able to handle at least the following cases:
The most obvious solution would seem to be disallowing all HTML tags. Any uploaded text is scanned for the characters < and > and, if those are present, the posting is rejected with an explanation. This wouldn't work out that well in a site for mathematicians! Maybe they need to use greater-than and less-than signs in their postings.
The beginning of a workable solution is a procedure, perhaps named
something such as
quoteHTML that takes a user-uploaded
text string and performs the following conversions:
That works great for fields such as
street_address, subject summary
lines, etc., where there is no value to having an HTML tag. For some
longer documents obtained from users, however, it might be nice to
enable them to use a restricted set of HTML tags such as B, I, EM, P,
BR, UL, LI, etc. If you're going to store HTML in the database once
and serve it back out thousands of times per day, it is better to check
for legal tags at upload time. The problem with checking for
disallowed tags such as SCRIPT, DIV, and FONT is that HTML keeps
getting extended in de jure and de facto ways. Unless you want the
responsibility of keeping current with all of the ways in which new
HTML tags can make browsers behave, it may be better to check for
approved tags. Either way, you'll want the allowed or disallowed tags
list to be kept in an easy-to-modify configuration file. Further, you
probably want to perform a bit of validation on the use of allowed
tags such as B or I. A user who makes a mistake and forgets to close
one of these tags might render 100 comments underneath in an unusual