Sites That Are Really Programs

Chapter 10: Sites That Are Really Programs

by Philip Greenspun, part of Philip and Alex's Guide to Web Publishing

Revised July 2003

The classic (circa 1993) Web site comprises static .html files in a Unix or Windows file system. This kind of site is effective for one-way non-collaborative publishing of material that seldom changes.

You needn't turn your Web site into a program just because the body of material that you are publishing is changing. Sites such as http://dir.yahoo.com, for example, are sets of static files that are periodically generated by programs grinding through a dynamic database. With this sort of arrangement, the site inevitably lags behind the database but you can handle millions of requests a day without a major investment in computer hardware, custom software, or thought.

If you want to make a collaborative site, however, at least some of your Web pages will have to be computer programs. Pages that process user submissions have to add user-supplied data to your Web server's disk. Pages that display user submissions have to look through a database on your server before delivering the relevant contributions.

In older times, if you wanted to publish completely static, non-collaborative material, at least one portion of your site would require server-side programming: the search engine. To provide a full-text search over the material on a site, the server would have to take a query string from the user, compare it to the files on the disk, and then return a page of links to relevant documents. Nowadays most people who wanted to do this would instead build a form targeting Google with hidden form variables that will restrict the search to the originating site's domain (see http://www.google.com/searchcode.html to set this up).

This chapter discusses the options available to Web publishers who need to write program-backed pages. Here are the steps:

Decide whether you're building a dynamic document or a program with a Web interface.
Choose a computer language.
Choose a program invocation mechanism.
Choose a Web server program to support the first three choices.

Step 1: Document or Program?

A document is typically edited or updated over a period of days. Changes in one portion of a document don't have far-reaching implications for other portions. A computer program is typically versioned, debugged, and tested over a period of months.

Every interesting Web site has some characteristics of both a document and a computer program. There is thus no correct answer to the question "Is your site a hypertext document with bits of computation or a computer program with bits of static text?" However, the tools that make it easy for a team of experts to develop a computer program will get in the way if your site is fundamentally a document. Conversely, the tools that make it convenient to edit a document can lead to sloppy and error-filled computer programs.

Server-side programming systems that take the document model to its logical extreme is Microsoft Active Server Pages (ASP). A vanilla HTML file is a legal ASP document. If you want to add some computation, you weave in little computer language fragments, surrounded by <% ... %>. If you want to fix a typo or a programming bug, you edit the .adp or .asp file and hit reload in your Web browser to see the new version. Almost always, the connection is direct and immediate between the URL where the problem was observed and the file on the server that you must edit. You don't have to understand much of the document's structure to fix a bug.

At the other end of the document/program spectrum are various "application servers" that require you to program in C or Java. HTML text is inevitably buried inside these programs. Fixing a typo requires editing the program, compiling the program, and reloading the compiled code into the Web or application server. If there is a problem with a URL, fixing it might require reading and editing dozens of program files and understanding most of the program's overall structure.

With the right tools and programmer resources, you can build a jewel-like software system to sit behind a Web site. But ask yourself whether the entire service isn't likely to be redesigned after six months, and if, realistically, your site isn't going to be thrown together hastily by overworked programmers. If so, perhaps it will be best to look for the tightest development cycle.

Step 2: Choose a Computer Language

People usually choose a computer language according to how well it supports management of complexity. Much of the complexity in a web site, however, is in the number of URLs and how they interact (i.e., how many form-variable arguments get passed from one page to another). So the system structure is very similar regardless of the computer language employed.

Consider these aspects:

support for complex data types
safety
development time
library

"PHP is a pet peeve of mine. They spent ungodly hours inflicting their own new scripting language on the world that is almost exactly like Perl."
-- a programmer friend who has suffered through 10 years of gratuitious changes in open-source Web development tools

You would think that picking a Web site development language would be trivial. Obviously the best languages are safe and incorporate powerful object systems. So let's do everything in Common Lisp or Java. Common Lisp can run interpeted as well as compiled, which makes it a more efficient language for developers. So Common Lisp should be the obvious winner of the Web server language wars. Yet nobody uses Common Lisp for server-side scripting. Is that because Java-the-hype-king has crushed it? No. In fact, to a first approximation, nobody uses Java for server-side scripting. Almost everyone is using simple interpreted languages such as Visual Basic, PHP, Perl, or Tcl.

How could a lame string-oriented scripting language possibly compete in power with systems programming languages? Well, guess what? The only data type that you can write to a Web browser is a string. And all the information from the relational database management system on which you are relying comes back to to the Web server program as strings. So maybe it doesn't matter whether your scripting language has an enfeebled type system.

Are these languages really the best? Computer scientists can't believe that a scripting language could be as good as Lisp and better than Java for developing Internet applications. But it turns out to be almost true. A scripting language is better than Java because a scripting langauge doesn't have to be compiled. A scripting language can be better than Lisp because string manipulation is simpler. For example, in Tcl or Perl


"posted by $email on $posting_date."

will generate a string from the fragments of static ASCII above plus the contents of the variables $email and $posting_date. These were presumably recently pulled from a relational database. The result might look something like


"posted by philg@mit.edu on February 15, 1998."

In Common Lisp, you'd have


(concatenate 'string "posted by " email " on " posting-date ".")

which uses a fabulously general mechanism for concatenating sequences. concatenate can work on sequences of ASCII characters (strings) or sequences of TCP packets or sequences of three-dimensional arrays or sequences of double-precision complex numbers. Sequences can either be lists (fast to modify) or vectors (fast to retrieve). This kind of flexibility, which Java apes, is wonderful except that Web programmers are concatenating strings 99.99 percent of the time and the scripting languages' syntactic shortcuts make code easier to read and more reliable.

If your source of persistent storage were an object database, which can directly represent complex types, a language such as C#, Common Lisp, or Java would be very useful for writing individual Web pages. But in our current world, which is overwhelming dominating by the relational database management system and its three types (string, number, and date), these languages add very little power.

Finally, don't forget that even if you're developing individual pages in a scripting language you can write substrate programs in C#, Java, PL/SQL and other more complex languages. If you're using the Microsoft .NET environment methods of C# classes can be invoked by any VB.NET program. If you want to do sophisticated computation on information that comes from the relational database, typically that can be done by a Java or PL/SQL program running inside Oracle or by a C# program running inside Microsoft SQL Server.

Step 3: Choose a Program Invocation Mechanism

Otters. Audubon Zoo. New Orleans, Louisiana.

What happens after the user requests a dynamic page? How does the server know that a program needs to be called and what does it have to do to run that program?

The oldest mechanism for program invocation via the Web is the Common-Gateway Interface (CGI). The CGI standard is an abstraction barrier that dictates what a program should expect from the Web server, for example, user form input, and how the program must return characters to the Web server program for them to eventually be written back to the Web user. If you write a program with the CGI standard in mind, it will work with any Web server program. You can move your site from Apache to Microsoft Internet Information System (IIS) and all of your CGI scripts will still work. You can give your programs away to other Web publishers who aren't running the same server program. Of course if you wrote your CGI program in C and compiled it for a Linux box, it might not run so great on a Windows XP machine.

Oops.

We've just discovered why most CGI scripts are written in Perl, PHP, Tcl, or some other interpreted computer language. The systems administrator can install the Perl or Tcl interpreter once and then Web site developers on that machine can easily run any script that they download from another site.

Fixing a bug in an interpreted CGI script is easy. A message shows up in the error log when a user accesses "http://yourserver.nerdu.edu/bboard/subject-lines.pl". If your Web server document root is at /web, then you know to edit the file /web/bboard/subject-lines.pl. After you've found the bug and written the file back to the disk, the next time the page is accessed the new version of the subject-lines Perl script will be interpreted.

For concreteness, let's summarize Unix CGI:

The server stuffs a bunch of information into Unix "environment variables," the name of the host that made the request, for example.
Form variable values get to the script via an environment variable (if a GET) or via standard-in (if a POST).
The server binds standard-out effectively to "the client's screen" so that the CGI script thinks it is writing straight to the user.
The first thing the CGI script must write is a content-type header that tells the client what sort of data to expect (HTML, plain text, a GIF, a JPEG, and so on).

It is all pretty much the same on Windows but the author hasn't personally attempted it. Here's an example CGI program:

#!/usr/contrib/bin/perl
# the first line in a Unix shell script says where to find the
# interpreter. If you don't know where perl lives on your system, type
# "which perl", "type perl", or "whereis perl" at any shell
# and put the result after the #!
print "Content-type: text/html\n\n";
# now we have printed a header (plus two newlines) indicating that the
# document will be HTML; whatever else we write to standard output will
# show up on the user's screen
print "<h3>Hello World</h3>";

This example program will print "Hello World" as a level-3 headline. If you want to get more sophisticated, read some on-line tutorials or CGI Programming with Perl (Birznieks et al; O'Reilly 2000).

It is that easy to write Perl CGI scripts and get server independence, a tight software development cycle, and ease of distribution to other sites. With that in mind, you might ask how many of the thousands of dynamic Web pages on popular Web sites use this program invocation mechanism. The answer? Maybe a couple.

Reason 1: Computers Do Not Like to Fork 500,000 Times a Day. Every time a CGI script is run, the Web server computer has to start a new process (fork). Think about how long it takes to start a program on a Macintosh or Windows desktop machine. It is a thousand times faster to indent a paragraph in an already-running word processor than it is to fire up that word processor to view even a one-paragraph document. You don't want your users to wait for that and you don't want to have to buy a whole rack of computers to serve a modestly popular site.
Reason 2: The RDBMS Does Not Like to Be Opened and Closed 500,000 Times a Day. Any time that you add collaboration to my site, user data are going into and out of a relational database management system (RDBMS). The RDBMS is implemented as a server that waits for requests for connections from client programs (see the chapter "Interfacing a Relational Database to the Web"). IBM, Microsoft, and Oracle have been working for two decades to make the RDBMS fast once a connection is established. Until the Web came along, however, nobody cared too much about how long it took to open a connection. With the Web came the CGI script, a program that runs for only a fraction of a second. In its brief life, it must establish a connection to the RDBMS, get the results of a query, and then close the connection. Users would get their data in about one-tenth the time if their requests could be handled by an already-connected RDBMS client.

Enter the server application programming interface (API). As discussed in the "So You Want to Run Your Own Server" chapter, most Web server programs allow you to supplement their behavior with extra software that you write. This software will run inside the Web server's process, saving the overhead of forking CGI scripts. Because the Web server program will generally run for at least 24 hours, it becomes the natural candidate to be the RDBMS client.

All Web server APIs allow you to specify "If the user makes a request for a URL that starts with /foo/bar/ then run Program X". The really good Web server APIs allow you to request program invocation before or after pages are delivered. For example, you ought to be able to say "When the user makes a request for any HTML file, run Program Y first and don't serve the file if Program Y says it is unhappy". Or "After the user has been served any file from the /car-reviews directory, run Program Z" (presumably Program Z performs some kind of logging).

Step 4: choose a Web server program to support the first three choices

Remember the steps:

Decide whether you're building a dynamic document or a program with a Web interface.
Choose a computer language.
Choose a program invocation mechanism.
Choose a Web server program to support the first three choices.

You've made the first three choices. Now you have to look around for Web server software that will support them. If you've settled on CGI as a program invocation mechanism, it won't really matter which server you use (server independence being the main point of CGI, after all). If you want to use a server's API, then you need to find a server program that supports the language and development style that you've chosen in steps 1 and 2.

Example 1: Redirect

100th Anniversary Boston Marathon (1996).

Our first foray in the Web publishing world was installing the NCSA 1.3 Web server program on our research group's file server, martigny.ai.mit.edu. We didn't bother to make an alias for the machine like "www.brian-and-philip.org" so the URLs we distributed looked like "http://martigny.ai.mit.edu/samantha/".

Sometime in mid-1994 the researchers depending on Martigny, whose load average had soared from 0.2 to 3.5, decided that a 100,000 hit per day Web site was something that might very nicely be hosted elsewhere. It was easy enough to find a neglected HP Unix box, which we called swissnet.ai.mit.edu. And we sort of learned our lesson and did not distribute this new name in the URL but rather aliases: "www-swiss.ai.mit.edu" for research publications of our group (known as "Switzerland" for obscure reasons); "photo.net" for photo stuff; "pgp.ai.mit.edu" for Brian's public key server.

But what were we to do with all the hard-wired links out there to martigny.ai.mit.edu? We left NCSA 1.3 loaded on Martigny but changed the configuration files so that a request for "http://martigny.ai.mit.edu/foo/bar.html" would result in a 302 redirect being returned to the user's browser so that it would instead fetch http://www-swiss.ai.mit.edu/foo/bar.html.

Two years later, in August 1996, someone upgraded Martigny from HP-UX 9 to HP-UX 10. Nobody bothered to install a Web server on the machine. Email began to trickle in "I searched for you on the Web but your server has been down since last Thursday." Eventually we figured out that the search engines were still sending people to Martigny, a machine that was in no danger of ever responding to a Web request since it no longer ran any program listening to port 80.

Those were the early days of Apache and we couldn't get it to compile. We downloaded an expensive commercial Web server made by a now-defunct company called "Netscape". It was a $5000 product, free to universities, with a built-in redirect facility but sadly there was a bug in the program and the redirects didn't work. Because the product was closed-source we couldn't fix it ourselves. Finally we installed AOLserver, which did not have a neat redirect facility, but its Tcl API seemed flexible enough that it would be possible to make the server do whatever we wanted.

First, we tell AOLserver to feed all requests to a Tcl procedure instead of looking around in the file system:

ns_register_proc GET / martigny_redirect

This is a Tcl function call. The function being called is named ns_register_proc. Any function that begins with "ns_" is part of the NaviServer Tcl API (NaviServer was the name of the program before AOL bought NaviSoft in 1995). ns_register_proc takes three arguments: method, URL, and procname. In this case, the code says that HTTP GETs for the URL "/" (and below) are to be handled by the Tcl procedure martigny_redirect:


proc martigny_redirect {} {
    append url_on_swissnet "http://www-swiss.ai.mit.edu" [ns_conn url]
    ns_returnredirect $url_on_swissnet
}

This is a Tcl procedure definition, which has the form "proc procedure-name arguments body". martigny_redirect takes no arguments. When martigny_redirect is invoked, it first computes the full URL of the corresponding file on Swissnet. The meat of this computation is a call to the API procedure "ns_conn" asking for the URL that was part of the request line.

With the full URL computed, martigny_redirect's second body line calls the API procedure ns_returnredirect. This writes back to the connection a set of 302 redirect headers instructing the browser to rerequest the file, this time from "http://www-swiss.ai.mit.edu".

Example 2: Customizing Access

Chappaquiddick, like the rest of Martha's Vineyard (Massachusetts) has only a tiny bit of public beach

MIT Press wanted to sell subscriptions to electronic journals either to institutions or to individuals. They also wanted portions of the journals to be freely available. In the case of an institutional subscriber, the server needed to recognize that the client came from a range of authorized IP addresses, e.g., any computer whose IP address starts with "36." is at Stanford, so if they've paid for a site-wide subscription, don't demand username or password from

Chappaquiddick Beach Club, sort of part of Martha's Vineyard, Massachusetts

individuals. For individuals, they decided to start by simply distributing the same username/password pair to all the subscribers. All of the information about who was authorized had to come from their relational database. It turned out that this set of constraints was too complex for the standard permissions module that comes with AOLserver, if only because it uses its own little Unix files-based database. Fortunately, the AOLserver API provides for program invocation prior to page service via a mechanism called filters. After 20 minutes we came up with the following program:

# tell AOLserver to watch for PDF file requests under the /ejournal directory
# if we don't add additional ns_register_filter commands, all the 
# other files will be available to everyone
ns_register_filter preauth GET /ejournal/*.pdf ejournal_check_auth

proc ejournal_check_auth {args why} {
    # all the parameters we might want to change
    set user "open"
    set passwd "sesame"
    # on the real-life server, these are pulled from a relational database
    # but here for an example, let's just set it to MIT and Stanford
    set allowed_ip_ranges [list "18.*" "36.*"]

    foreach pattern $allowed_ip_ranges {
	if { [string match $pattern [ns_conn peeraddr]] } {
	    # a paying customer; the file will be sent
	    return "filter_ok"
	}
    }

    # not coming from a special IP address, let's check the 
    # username and password headers that came with the request
    if { [ns_conn authuser] == $user && [ns_conn authpassword] == $passwd } {
	# they are an authorized user; the file will be sent
	return "filter_ok"
    }

    # not a good IP address, no headers, hammer them with a 401 demand
    ns_set put [ns_conn outputheaders] WWW-Authenticate "Basic realm=\"MIT Press:Restricted\""
    ns_returnfile 401 text/html "[ns_info pageroot]ejournal/please-subscribe.html"

    # stop AOLserver from handling the request by returning a special code
    return "filter_return"
}

Example 3: Aid to Evaluating Your Accomplishments (randomizing a page)

"For me grad school is fun just like playing Tetris all night is fun. In the morning you realize that it was sort of enjoyable, but it didn't get you anywhere and it left you very very tired."
-- Michael Booth's comment on the philip.greenspun.com "Women in Computing" page

Computer science graduate students earn a monthly stipend that wouldn't cover the average yuppie's SUV payments and gas bill. If you've been reading Albert Camus lately ("It is a kind of spiritual snobbery to think one can be happy without money") then you'd expect this to lead to occasional depression. For these depressed souls, there is Career Guide for Engineers and Scientists (http://philip.greenspun.com/careers/).

Thought 1: starving graduate students forgoing six years of income would be cheered to read the National Science Foundation report that "Median real earnings remained essentially flat for all major non-academic science and engineering occupations from 1979-1989. This trend was not mirrored among the overall work force where median income for all employed persons with a bachelor's degree or higher rose 27.5 percent from 1979-1989 (to a median salary of $28,000)."

Thought 2: custom photography would help get the message across (see ).

Thouht 3: we could really get under the skin of America's best and brightest young computer scientists with Aid to Evaluating Your Accomplishments (see ).

Here's the source code:

# a helper procedure to pick N items randomly from a list
# note that it uses tail-recursion, importing a little bit 
# of the clean Scheme philosophy into the ugly world of Tcl

proc choose_n_random {choices_list n_to_choose chosen_list} {
    if { $n_to_choose == 0 } {
	return $chosen_list
    } else {
	set chosen_index [randomRange [llength $choices_list]]
	set new_chosen_list [lappend chosen_list [lindex $choices_list $chosen_index]]
	set new_n_to_choose [expr $n_to_choose - 1]
	set new_choices_list [lreplace $choices_list $chosen_index $chosen_index]
	return [choose_n_random $new_choices_list $new_n_to_choose $new_chosen_list]
    }
} 

# we encapsulate the printing of an individual person so that 
# one day we can easily change the design of the page (we display
# four people at once and putting this in a procedure keeps us from
# having to edit the same code four times).

proc one_person {person} {
    set name [lindex $person 0]
    set title [lindex $person 1]
    set achievement [lindex $person 2]
    return "<h4>$title $name</h4>\n $achievement <br><br> <center> (<a href=\"http://altavista.digital.com/cgi-bin/query?pg=q&what=web&fmt=&q=[ns_urlencode $name]\">more</a>) </center>\n"
}

# we return HTTP headers to the client

ReturnHeaders

# we return as much of the page as we can before figuring out which four
# people we're going to display; this way if we were going to query a 
# relational database (potentially taking 1/2 second), the user would
# have something on-screen to read

ns_write "<html>
<head>
<title>Aid to Evaluating Your Accomplishments</title>
</head>

<body bgcolor=#ffffff text=#000000>
<h2>Aid to Evaluating Your Accomplishments</h2>

part of <a href=\"/philg/careers.html\">Career Guide for Engineers and Scientists</a>


<hr>

Compare yourself to these four ordinary people who were selected at random:

<br>
<br>
"

# each person is name, title, accomplishment(s)

set einstein [list "A. Einstein" "Patent Office Clerk" \
                   "Formulated Theory of Relativity."]

set mill [list "John Stuart Mill" "English Youth" \
               "Was able to read Greek and Latin at age 3."]

set mozart [list "W. A. Mozart" "Viennese Pauper" \
                 "Composed his first opera, <i>La finta
                 semplice</i>, at the age of 12."]

set jesus [list "Jesus of Nazareth" "Judean Carpenter" \
                "Told young women he was God and they believed him."]

set stevens [list "Wallace Stevens" "Hartford Connecticut Insurance Executive" 
                  "Won Pulitzer Prize for Poetry in 1954; best known for
                   \"Thirteen Ways of Looking at a Blackbird\"."]

# ... there are a bunch more in the real live script

set average_folks [list $einstein $mill $mozart $jesus]

# we call our choose_n_random procedure, note that we give it an empty
# list to kick off the tail-recursion

set four_average_folks [choose_n_random $average_folks 4 [list]]

ns_write $conn "<table cellpadding=20>
<tr>
<td valign=top>
[one_person [lindex $four_average_folks 0]]
</td>
<td valign=top>
[one_person [lindex $four_average_folks 1]]
</td>
</tr>
<tr>
<td valign=top>
[one_person [lindex $four_average_folks 2]]
</td>
<td valign=top>
[one_person [lindex $four_average_folks 3]]
</td>
</tr>
</table>
"

# note how in the big block of static HTML below, we're forced to 
# put backslashes in front of the string quotes.  This is annoying 
# and we wouldn't have to do it if we'd implemented this using
# AOLserver Dynamic Pages (where the text is HTML by default, 
# Tcl code by exception).

ns_write $conn "

<p>

Programmed by <a href=\"http://www.ugcs.caltech.edu/~eveander/\">Eve
Astrid Andersson</a> and <a href=\"/philg/\">Philip Greenspun</a> in
<a href=\"/wtr/servers.html#naviserver\">AOLserver Tcl</a>.  If you're
a nerd, you might find <a href=\"four-random-people.txt\">the source
code</a> useful.

<P>

Original Inspiration: <cite>How to Make Yourself Miserable</cite>, by
Dan Greenburg

<hr>
<a href=\"/philg/\"><address>philg@mit.edu</address></a>
</body>
</html>
"

Example 4: Focal Length Calculator (taking data from users)

Alex in front of the Green Building. Massachusetts Institute of Technology

Back in the 1960s, an IBM engineer had a good idea. Build a smart terminal that could download a form from a mainframe computer. The form would have reserved fields for display only, input fields where the user could type, and blinking fields. After the user had filled out all the input fields, the data would be submitted to the mainframe and acknowledged or rejected if there were any mistakes. This method of interaction was rather frustrating and disorienting for users but made efficient use of the mainframe's precious CPU time. This was the "3270" terminal and hundreds of thousands were sold 20 years ago, mostly to big insurance companies and the like.

The forms user interface model fell into the shade after 1984 when the Macintosh "user drives" pull-down menu system was introduced. However, HTML forms as classically conceived work exactly like the good old 3270. Here's an example that is firmly in the 3270 mold, taken from the Lens chapter of my photography tutorial textbook (http://www.photo.net/making-photographs/lens). The basic idea is to help people figure out what size lens they will need to buy or rent in order to make a particular image. They fill in a form with distance to subject and the height of their subject (see ). The server then tells them what focal length lens they need for a 35mm camera.

Here's the HTML source for the form:


<form method=post action=focal-length.tcl>
How far away is your subject?  
<input type=text name=distance_in_feet size=7>  (in feet)
<p>
How high is the object you want to fill the frame?  
<input type=text name=subject_size_in_feet size=7>  (in feet)

<p>

<input type=submit>

</form>

Here's the AOLserver Tcl program that processes the user input:


set_form_variables

# distance_in_feet, subject_size_in_feet are the args from the form
# they are now set in Tcl local variables thanks to the magic 
# utility function call above

# let's do a little IBM mainframe-style error-checking here

if { ![info exists distance_in_feet] || [string compare $distance_in_feet ""] == 0 } {
    ns_return 200 text/plain "Please fill in the \"distance to subject\" field"
    # stop the execution of this script
    return
}

if { ![info exists subject_size_in_feet] || [string compare $subject_size_in_feet ""] == 0 } {
    ns_return 200 text/plain "Please fill in the \"subject size\" field"
    # stop the execution of this script
    return
}

# we presume that subject is to fill a 1.5 inch long-dimension of a
# 35mm negative

# ahhh... the joys of arithmetic in Tcl, a quality language so 
# much cleaner than Lisp

set distance_in_inches [expr $distance_in_feet * 12]
set subject_size_in_inches [expr $subject_size_in_feet * 12]

set magnification [expr 1.5 / $subject_size_in_inches]

set lens_focal_length_inches [expr $distance_in_inches / ((1/$magnification) + 1)]

set lens_focal_length_mm [expr round($lens_focal_length_inches * 25.4)]

# now we return a page to the user, one big string into which we let Tcl
# interpolate some variable values

ns_return $conn 200 text/html "<html>
<head>
<title>You need $lens_focal_length_mm mm </title>
</head>

<body bgcolor=#ffffff text=#000000>
<table>
<tr>
<td>
<a href=\"/images/pcd0952/boston-marathon-46.tcl\"><img HEIGHT=198 WIDTH=132 src=\"/images/pcd0952/boston-marathon-46.1.jpg\" ALT=\"100th Anniversary Boston Marathon (1996).\"></a>
<td>


<h2>$lens_focal_length_mm millimeters</h2>

will do the job on a Nikon or Canon or similar 35mm camera

<P>

(according to the <a href=\"http://www.photo.net/photo/tutorial/lens.html\">photo.net lens tutorial</a> calculator)

</tr>
</table>

<hr>

Here are the raw numbers:

<ul>
<li>distance to your subject:  $distance_in_feet feet ($distance_in_inches inches)
<li>long dimension of your subject:  $subject_size_in_feet feet ($subject_size_in_inches inches)
<li>magnification:  $magnification
<li>lens size required:  $lens_focal_length_inches inches ($lens_focal_length_mm mm)

</ul>

Assumptions: You are using a standard 35mm frame (24x36mm) whose long
dimension is about 1.5 inches.  You are holding the camera in portrait
mode so that your subject is filling the long side of the frame.  You
are supposed to measure subject distance from the optical midpoint of
the lens, which for a normal lens is roughly at the physical midpoint.

<P>

Source of formula:  <a href=\"http://www.photo.net/photo/dead-trees/professional-photoguide.html\">Kodak 
Professional Photoguide</a>
<br>
Source of server-side programming knowledge:  Chapter 9 of 
<a href=\"http://www.photo.net/wtr/dead-trees/\">How to be a Web Whore Just Like Me</a>
<br>
Time required to write this program:  15 minutes. 
<br>
Proof that philg is a nerd:  <a href=\"focal-length.txt\">view the source code</a>
<br>

What this is not: a slow Java program that will crash everyone's browser (except those behind corporate firewalls that block all Java applets)

<br>

Another thing this is not:  a CGI program that will make my poor old Unix box fork

<br>

Yet another thing this is not: a JavaScript program that you'd think
would be the right thing but then on the other hand it wouldn't work with some browsers and the last thing that I need is email from confused users


<h3>Bored?  Try again</h3>

<form method=post action=focal-length.tcl>
How far away is your subject?  
<input type=text name=distance_in_feet size=7 value=\"$distance_in_feet\">  (in feet)
<p>
How high is the object you want to fill the frame?  
<input type=text name=subject_size_in_feet size=7 value=\"$subject_size_in_feet\">  (in feet)

<p>

<input type=submit>

</form>

<h3>European?  Macro-oriented?</h3>

<form method=post action=focal-length-mm.tcl>
How far away is your subject?  
<input type=text name=distance_in_mm size=7>  (in millimeters)
<p>
How high is the object you want to fill the frame?  
<input type=text name=subject_size_in_mm size=7>  (in millimeters)

<p>

<input type=submit>

</form>


<hr>
<a href=\"/philg/\"><address>philg@mit.edu</address></a>
</body>
</html>"

Example 5: Bill Gates Personal Wealth Clock (taking data from foreign servers)

Academic computer scientists are the smartest people in the world. There are an average of 800 applications for every job. And every one of those applicants has a PhD. Anyone who has triumphed over 799 PhDs in a meritocratic selection process can be pretty sure that he or she is a genius. Publishing is the most important thing in academics. Distributing one's brilliant ideas to the adoring masses. The top computer science universities have all been connected by the Internet or ARPAnet since 1970. A researcher at MIT in 1975 could send a technical paper to all of his or her interested colleagues in a matter of minutes. With this kind of heritage, it is natural that the preferred publishing medium of 21st Century computer science academics is . . . dead trees.

Yes, dead trees.

If you aren't in a refereed journal or conference, you aren't going to get tenure. You can't expect to achieve quality without peer review. And peer review isn't just a positive feedback mechanism to enshrine mediocrity. It keeps uninteresting papers from distracting serious thinkers at important conferences. For example, there was this guy in a physics lab in Switzerland, Tim Berners-Lee. And he wrote a paper about distributing hypertext documents over the Internet. Something he called "the Web". Fortunately for the integrity of academia, this paper was rejected from conferences where people were discussing truly serious hypertext systems.

Anyway, with foresight like this, it is only natural that academics like to throw stones at successful unworthies in the commercial arena. The "Why Bill Gates is Richer than You" section on philip.greenspun.com didn't come into its own until the day Brian announced to our little research group at MIT that the U.S. Census Bureau had put up a real-time population clock at http://www.census.gov/cgi-bin/popclock. There had been stock quote servers on the Web almost since Day 1. How hard could it be to write a program that would reach out into the Web and grab the Microsoft stock price and the population, then do the math to come up with what you see at http://philip.greenspun.com/WealthClock (see ).

This program was easy to write because the AOLserver Tcl API contains the ns_httpget procedure. Having a personal server grab a page from the Census Bureau is as easy as

ns_httpget "http://www.census.gov/cgi-bin/popclock"

Tcl the language made life easy because of its built-in regular expression matcher. The Census Bureau and the Security APL stock quote folks did not intend for their pages to be machine-parsable. Yet only a short program was necessary to pull the raw numbers out of a page designed for reading by humans.

Anyway, here is the code. Look at the comments.

# this program copyright 1996, 1997 Philip Greenspun (philg@mit.edu)
# redistribution and reuse permitted under
# the standard GNU license
# this function turns "99 1/8" into "99.125"
proc wealth_RawQuoteToDecimal {raw_quote} {
    if { [regexp {(.*) (.*)} $raw_quote match whole fraction] } {
 # there was a space
 if { [regexp {(.*)/(.*)} $fraction match num denom] } {
     # there was a "/"
     set extra [expr double($num) / $denom]
     return [expr $whole + $extra]
 }
 # we couldn't parse the fraction
 return $whole
    } else {
 # we couldn't find a space, assume integer
 return $raw_quote
    }
}
###
#   done defining helpers, here's the meat of the page
###
# grab the stock quote and stuff it into QUOTE_HTML
set quote_html [ns_httpget "http://qs.secapl.com/cgi-bin/qs?ticks=MSFT"]

# regexp into the returned page to get the raw_quote out
regexp {Last Traded at</a></td><td align=right><strong>([^A-z]*)</strong>} \
       $quote_html match raw_quote

# convert whole number + fraction, e.g., "99 1/8" into decimal,
# e.g., "99.125"
set msft_stock_price [wealth_RawQuoteToDecimal $raw_quote]
set population_html [ns_httpget "http://www.census.gov/cgi-bin/popclock"]

# we have to find the population in the HTML and then split it up
# by taking out the commas
regexp {<H1>[^0-9]*([0-9]+),([0-9]+),([0-9]+).*</H1>} \
       $population_html match millions thousands units

# we have to trim the leading zeros because Tcl has such a
# brain damaged model of numbers and thinks "039" is octal
# this is when you kick yourself for not using Common Lisp
set trimmed_millions [string trimleft $millions 0]
set trimmed_thousands [string trimleft $thousands 0]
set trimmed_units [string trimleft $units 0]

# then we add them back together for computation
set population [expr ($trimmed_millions * 1000000) + \
                     ($trimmed_thousands * 1000) + \
                     $trimmed_units]

# and reassemble them in a string for display
set pretty_population "$millions,$thousands,$units"

# Tcl is NOT Lisp and therefore if the stock price and shares are
# both integers, you get silent overflow (because the result is too
# large to represent in a 32 bit integer) and Bill Gates comes out as a
# pauper (< $1 billion). We hammer the problem by converting to double
# precision floating point right here.
#
# (Were we using Common Lisp, the result of multiplying two big 32-bit
# integers would be a "big num", an integer represented with multiple
# words of memory; Common Lisp programs perform arithmetic correctly.
# The time taken to compute a result may change when you move from a
# 32-bit to a 64-bit computer but the result itself won't change.)
set gates_shares_pre_split [expr double(141159990)]
set gates_shares [expr $gates_shares_pre_split * 2]
set gates_wealth [expr $gates_shares * $msft_stock_price]
set gates_wealth_billions \
    [string trim [format "%10.6f" [expr $gates_wealth / 1.0e9]]]
set personal_share [expr $gates_wealth / $population]
set pretty_date [exec /usr/local/bin/date]

# we're done figuring, now let's return a page to the user
ns_return 200 text/html "<html>
<head>
<title>Bill Gates Personal Wealth Clock</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h2>Bill Gates Personal Wealth Clock</h2>
just a small portion of 
<a href=\"http://www-swiss.ai.mit.edu/philg/humor/bill-gates.html\">Why Bill Gates is Richer than You
</a>
by
<a href=\"http://www-swiss.ai.mit.edu/philg/\">Philip Greenspun</a>
<hr>
<center>
<br>
<br>
<table>
<tr><th colspan=2 align=center>$pretty_date</th></tr>
<tr><td>Microsoft Stock Price:
    <td align=right> \$$msft_stock_price
<tr><td>Bill Gates's Wealth:
    <td align=right> \$$gates_wealth_billions billion
<tr><td>U.S. Population:
    <td align=right> $pretty_population
<tr><td><font size=+1><b>Your Personal Contribution:</b></font>
    <td align=right>  <font size=+1><b>\$$personal_share</font></b>
</table>
<p>
<blockquote>
\"If you want to know what God thinks about money, just look at the
 people He gives it to.\" <br> -- Old Irish Saying
</blockquote>
</center>
<hr>
<a href=\"http://www.photo.net/philg/\"><address>philg@mit.edu</address>
</a>
</body>
</html>
"

So is this the real code that sits behind http://philip.greenspun.com/WealthClock?

Actually, no.

Why the differences? I was concerned that, if it became popular, the Wealth Clock might impose an unreasonable load on the subsidiary sites. It seemed like bad netiquette for me to write a program that would hammer the Census Bureau and Security APL several times a second for the same data. It also seemed to me that users shouldn't have to wait for the two subsidiary pages to be fetched if they didn't need up-to-the-minute data.

Ten lines of Tcl suffices to create a general purpose caching facility that can cache the results of any Tcl function call as a Tcl global variable. This means that the result is stored in the AOLserver's virtual memory space and can be accessed much faster even than a static file. Users who want a real-time answer can demand one with an extra mouse click. The calculation performed for them then updates the cache for casual users.

Does this sound like overengineering? It didn't seem that way when Netscape, then makers of the world's most popular Web browser, put the Wealth Clock on their What's New page for two weeks (summer 1996). The URL was getting two hits per second. Per second. And all of those users got an instant response. The extra load on the Web server was not noticeable. Meanwhile, all the other sites on Netscape's list were unusably slow. Popularity had killed them.

Here are the lessons from this example:

Powerful APIs lead to innovative Web sites; I would probably have never gotten around to writing the Wealth Clock if it hadn't been for the ns_httpget call.
Hard-core performance engineering pays off; Web sites can catch on fast (and fade fast too).
You want to get your site linked from one of the default pages for a popular browser.

Epilogue: Our friend Brian, who started the whole Wealth Clock craze with his Census Bureau site discovery, now works at Microsoft, along with all the other smart computer science PhDs that we know who aren't at universities.

Example 6: AOLserver Dynamic Pages

As long as we're on the subject of Bill Gates, it is worth demonstrating the syntax and style that his company inspired with its Active Server Pages. The folks at America Online fell in love with this idea but not with the idea of forever wedding their Web services to Windows and IIS. Thus they added a similar facility to AOLserver called AOLserver Dynamic Pages (ADP), which underly the original WimpyPoint system, described in Chapter 1.

The idea is that someone will come to the site, look for the name of the author, then click down to find the presentation of interest.

Here's the ADP source code:

<% wimpy_header "Choose Author" %>

<h2>Choose an Author</h2>

in <a href="/"><%=[wimpy_system_name]%></a>

<hr>

Here's a list of users who have public presentations:

<ul>

<%

set db [ns_db gethandle]
set selection [ns_db select $db "select distinct u.user_id, u.last_name, u.first_names,  u.email
from wimpy_users u, wimpy_presentation_ownership wpo, wimpy_presentations wp
where u.user_id = wpo.user_id
and wpo.presentation_id = wp.presentation_id
and wp.public_p = 't'
order by upper(u.last_name), upper(u.first_names)"]

while { [ns_db getrow $db $selection] } {
    set_variables_after_query
    ns_puts "<li><a href=\"user-top.adp?user_id=$user_id\">$last_name, $first_names ($email)</a>\n"
}

%>

</ul>

Or you can do a full-text search through all the slides:

<form method=GET action="search.adp"> 
Query String:  <input type=text name=query_string size=50>
<input type=submit value="Submit">
</form>

<% wimpy_footer %>

Note that one is allowed to use arbitrary HTML, including string quotes, at the top level of the file. Note further that there are two escapes to the ADP evaluator. The basic escape is <%, which will execute a bunch of Tcl code for effect. If the Tcl code wants to write some bytes to the browser, it has to call ns_puts. The second escape sequence is <%=, which will execute a bunch of Tcl code and then write the result out to the browser. Generally one uses the <%= style for simple things, e.g., including the system name that is returned from the Tcl procedure wimpy_system_name. One uses the <% style to execute a sequence of Tcl procedures to query the database, etc.

Example 7: Active Server Pages

I haven't personally written any Microsoft Active Server Pages. Fortunately, Microsoft set up Windows/IIS/ASP back in the mid-1990s such that if you were curious to see the source code behind http://foobar.com/yow.asp, you had only to type "http://foobar.com/yow.asp." (note the trailing period) into your browser and the foreign server would deliver the source code right to your desktop. This was a great convenience for people trying to learn ASP; however, it presented something of a security problem for Web publishers, because they would often have their database or system administration passwords in the source code. It seems that Microsoft's intention was not to make public all of its customers' source code and hence they eventually released a security patch to change this behavior. However, a few months later people learned that requesting "http://foobar.com/yow.asp::$DATA" (note the trailing "::$DATA") would also get them the source code.

A nice collection of ASP examples at http://philip.greenspun.com/books/panda/aspharvest/ was harvested in just a couple of hours of surfing one night in July 1998. It is a bit interesting that this surfing was done some time after the bug had become common knowledge yet companies such as DIGITAL, Arthur Andersen, and banks had not patched their servers. What is even more interesting is that by July 2003 nearly all of those companies have gone bankrupt or been absorbed.

firewall.asp is amusing because it is DIGITAL's advertisement for their network security products. Similarly GAP Instrument Corp. took the trouble to warn users

You have reached a computer system providing United States government information. Unauthorized access is prohibited by Public Law 99-474, (The Computer Fraud and Abuse Act of 1986) and can result in administrative, disciplinary or criminal proceedings.

yet had left their ASP pages wide open.

CompuServe gives us a nice simple example with Conf.asp. The goal of the script is to first figure out whether the person browsing is a CompuServe member or not and then serve one of two entirely separate HTML pages. An if statement is thus opened inside one <% %> and closed in another:

<!--#INCLUDE VIRTUAL="/Forums/member.inc"-->
<% if member = 1 then %>
<HTML>
<HEAD>
<TITLE>TW Crime Forum</TITLE>
</HEAD>
<BODY BGCOLOR=#FFFFFF>

... ** a page for members *** ..

</BODY>
</HTML>

<BR><I>We Update the Forum Directory Weekly.  The directory was last updated: Thursday, January 08, 1998</I>
...
</BODY>
</HTML>

<% else %>
<HTML>
<HEAD>
<TITLE>TW Crime Forum</TITLE>
</HEAD>
<BODY BGCOLOR=#FFFFFF>

... ** a page for non-members ***

</BODY>
</HTML>
<%End If%>

An interesting thing to note about this page is that CompuServe hasn't run their HTML through a syntax checker, which would no doubt have complained about the stuff after the </HTML> (I've highlighted the extraneous text in bold, above).

Let's move on to some db-backed pages.

The folks who built Fulton Bank's site are very enthusiastic about Microsoft:

"The hottest technology to hit the Internet which is actually useable now is Active Server Page scripting. This has given us a number of advantages over the ancient art of CGI. ... Intranets and Extranets where the variety of user machine platforms, processors, etc are an issue ASP can play in nicely."
-- Xspot.com (once apparently a thriving Web development concern, now apparently bankrupt)

Let's see how ASP works for them in process_product.asp, a script that takes a query string and tries to find banking products that match this query string.

<% affcode = 1057 %>

<HTML>
<HEAD>
<TITLE>Fulton Bank</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF">

<BLOCKQUOTE>
<TABLE WIDTH=370 ALIGN="middle">
<TR>
<TD>
<BR>
<IMG SRC="images/header_products.gif"><BR>
<BR>
<BR>

<% 
   Set Conn=Server.CreateObject("ADODB.Connection")
   Conn.Open "FultonAffiliates"

   SQL = "SELECT * 
FROM products 
WHERE productname 
LIKE '%" & Request.Form("product") & "%' 
AND affiliate = '" & affcode & "'"

   Set RS = Conn.Execute(SQL)
%>

<TABLE>

<% if RS.EOF then %>
<TR><TD>Sorry No Products Found</TD></TR>
<% end if %>

<% DO UNTIL RS.EOF %>
<TR>
<TD VALIGN="top"><IMG SRC="images/diamond3.gif"></TD>
<TD>
<A HREF="<% = RS("url") %>"><FONT COLOR="blue"><% = RS("productname") %></FONT></A><BR>
<% = RS("shortdesc") %><BR>
<BR> <BR>
</TD>
</TR>
<% RS.MoveNext %>
<% LOOP %>
</TABLE></BLOCKQUOTE>
</TD>
</TR>
</TABLE>
<% rs.close
   conn.close
%>
<!--#include file="footer.asp"-->
</BODY>
</HTML>

This is some pretty clean code. The programmers have encapsulated the database password in their ODBC connection configuration. Also, rather than just bury the magic number "1057" in the code, they set affcode to it as the very first line of the program. Finally, they've parked the page footer in a centralized footer.asp file that gets included by all of their scripts.

Summary

Server-side programming is straightforward and can be done in almost any computer language. However, making the wrong technology decisions can result in a site that requires ten times the computer hardware to support. Bad programming can also result in a site that becomes unusable as soon as you've gotten precious publicity. Finally, the most expensive asset you are developing on your Web server is content. It is worth thinking about whether your server-side programming language helps you get the most out of your investment in content.

Structure and Interpretation of Computer Programs (Abelson and Sussman 1996; MIT Press) is the book that we use at MIT to teach people how to program. Even if you're already an experienced programmer, the book can be inspiring and useful for the vocabulary it introduces.

or move on to Chapter 11: Sites that are really databases

philg@mit.edu

Reader's Comments

Note that the final example has a major security flaw - it incorporates strings from the users request directly into the text of a sql query. This is subject to 'SQL Injection' - carefully crafted sql could alter the semantics of the query to return more information than intended by the site authors. Real DB applications will use parameterized sql these days.

-- Lee Schumacher, March 9, 2005

Add a comment | Add a link