Better Living Through Chemistry

Chapter 16: Better Living Through Chemistry

by Philip Greenspun, part of Philip and Alex's Guide to Web Publishing

Revised July 2003

Part of an old gas pump. King's Canyon National Park, California.

This chapter explores two Internet applications. The first tells you whether your neighborhood is being polluted and, if so, by whom. The second helps groups of citizens communicate with politicians and other decision-makers. We will examine the technical challenges presented by these applications, but first let's look at how the service ideas themselves fit into the modern conception of the Internet.

Two Ways of Looking at a Blackbird

Halibut caught by tourists in Homer, Alaska.

If we divide Internet applications into for-profit and not-for-profit a pattern emerges: the for-profit applications can be remarkably useful; the not-for-profit efforts look like things that a 12-year-old could have done in Microsoft FrontPage.

In some ways Internet should be an ideal area for non-profit organizations. Computers are good amplifiers of the human imagination and the Internet is a fantastically good amplifier: one good idea and a little bit of programming has the potential to attract millions of people to a site.

Online communities in particular ought to be especially powerful for non-profit orgs. For-profit corporations can afford to hire full-time employees and rent an office large enough to be a shared workspace for those employees, Monday through Friday, 9-5, every week of the year. A non-profit org, by contrast, may depend substantially on volunteer labor. Most people are unwilling or unable to volunteer 40 hours per week. Perhaps Joe Potential-Volunteer is able to contribute only an hour or two per week and lives on the other side of the country from DoGooders.org. Joe isn't going to be very useful in the physically-centralized work of DoGooders but he could be an equal participant in any work that was done in or coordinated through an online community.

Sadly, however, as of 2003 the average foundation prefers not to fund projects involving information technology. Non-profit orgs spend billions of dollars every year on IT but nearly all of it goes to purchasing and maintaining Microsoft Windows machines, Microsoft Office, and Microsoft Exchange Server rather than innovative Web-based applications that could amplify their reach and expand their volunteer labor force.

Environmental Defense (www.environmentaldefense.org) has been an exception to the usual non-profit rule and this chapter discusses two applications that they conceived and constructed.

Government Regulation Protecting You

In the bad old days there was no government regulation of pollution. If you were a farmer and the paper mill upstream was poisoning the river, you couldn't complain that the mill was violating statutes and administrative rules. Your only remedy was to sue the mill under Common Law (legal precedent going back to Germanic tribal law).

How effective could that be?

It turns out that under Common Law, people have an absolute right to clean air and water. If the paper mill was polluting the river, they had to pay for all the damage they did to downstream users of the water. If a plaintiff proved damage and causality (the difficulty of which is chronicled in A Civil Action), a court would order the mill to compensate the farmer for killed livestock, vet bills, and less plentiful crops.

Under the current legal system, the government regulates pollution by promulgating state and federal laws and making administrative decisions. If your cows are dying and your fields won't support crops, you can still sue the paper mill. However, they don't have to pay you anything if they can show that their output of filth is not in excess of the statutory limit.

Workman's compensation laws grew out of the same kind of situation. When a factory worker was killed because of the company's negligence, the relatives of the deceased would sue the owners for megabucks. The owners didn't want to invest in making the factories safe and they didn't want to pay big jury awards. So they got politicians to pass workman's compensation laws limiting their liability to $X for the loss of an arm or $Y for a worker who was killed. That's why you find injured workers suing the companies that make the machines on factory floors rather than the factories themselves.

If a factory isn't legally required to reduce pollution beyond the limits established by statute, how then to nudge them into reducing discharges?

Is Information Power?

Stockholm, Sweden, viewed from Millesgarden

Although the Environmental Defense is probably best-known for convincing McDonald's to eliminate its wasteful packaging, the 200-person organization has been working for nearly 40 years to increase the amount of information available about pollution, environmental hazards, and waste generation.

Is distributing information sufficient to cut pollution? Remarkably, it seems to be.

In 1988 the Environmental Protection Agency began collecting and publishing data on chemicals released from manufacturing facilities. If you check the nationwide report (http://www.scorecard.org/env-releases/us.tcl), you'll see a dramatic reduction in releases of chemicals covered by the federal Toxic Release Inventory (TRI).

What could a consumer learn from the TRI data? That Visteon in Connersville, Indiana released 605,000 pounds of trichloroethylene to the air in 1995 (see http://www.scorecard.org/env-releases/facility.tcl?tri_id=47331FRDLCSTATE). Is that good or bad? Is trichloroethylene hazardous? Carcinogenic?

In California, voters passed Proposition 65 in 1986. This requires industry not only to disclose but also warn the public about exposure to known carcinogens and reproductive toxicants. Thus a resident of California would be warned that a local factory was releasing trichloroethylene and that it was a recognized carcinogen and reproductive toxicant.

Note that trichloroethylene is the villain, along with W.R. Grace and Beatrice Foods, in the gripping book A Civil Action, which covers the litigation by families in Woburn, Massachusetts whose children had died of leukemia.

The result? California companies cut emissions of TRI chemicals roughly in half, just as manufacturers did elsewhere in the nation. But California companies cut emissions of Prop 65 chemicals to one quarter of their previous usage. Apparently disclosure plus interpretation is more powerful than disclosure alone.

Would 750 MB of data help you out?

Thanks to the Internet it turns out to be pretty easy to pull together 750 MB of data on toxic releases and the health effects of various chemicals. This is potentially useful and interesting. How do we make it actually useful and interesting?

Interpret -- show users how a chemical is used, by which industries, and how hazardous it is.
Localize -- let users type in their zip code and see what's being released in their neighborhood.
Rank -- show users how their county compares to others in the state or nation; show users which factories put out the most toxins.
Slice -- let users see which chemicals are harmful to developing children.
Facilitate -- help users contact the companies operating these facilities by letting them send faxes to the top-ranking polluters.

Our attempt to do all of these things is http://www.scorecard.org. Did we succeed? Bill Pease, the U.C. Berkeley professor who dreamed up the idea, would probably point to the 40 users a second who showed up on the very first day (April 15, 1998) or Dupont's "The Goal is Zero" advertising campaign. Alternatively one could look to user feedback:

"What makes you think that normal nonbrainwashed average americans care about your alarmist, victim-minded, advocacy-pseudoscientific, anti-profit, anti-success, anti-business, anti-American, emotion over logic, big brother, landgrabbing, legal enemy of landowners large and small, mankind hating, pantheistic, gaia worshippping, tree-hugger drivel?"
-- email received by the author July 23, 1998

If a site doesn't make people angry enough to sue then it probably isn't saying much.

Personalization

Charter fishing captain cleaning a tourist's catch. On the wharf in Menemsha, Martha's Vineyard, Massachusetts

We decided that Scorecard would be a showcase for personalization. Users would tell us

where they lived
whether they were pregnant
whether they preferred a text-only or graphics site
whether they wanted to see chemicals displayed as upper or lowercase
whether they wanted longer lists to display as HTML tables (slow) or bullet lists (fast)

Nearly every page on the site would be able to use these personal data to increased the relevancy of the information. Our original idea was to have the "personalize" link target a separate browser window (<a target=personalize href=personalize.tcl>). This way a user would be able to play with personalization settings and reload a Scorecard page to see the effects.

What percentage of Scorecard users took advantage of this fancy system? To a first approximation, 0 percent. Though we only asked for e-mail address and name, most people apparently didn't understand how the "personalize" form would help them. Only about 1000 people per month were personalizing. So we augmented our database-backed system with simpler one-click "text-only" or "graphics site" links that set a persistent cookie on the user's browser:

Set-Cookie: scorecard_graphics_default=f; path=/; expires=Fri, 01-Jan-2010 01:00:00 GMT

Sending faxes

One of the key elements of Scorecard is that users can fill out a form on our site and then send a fax to the factory down the street. This turned out to require only about one hour of programming. There are lots of companies that will take a carefully formatted e-mail message and send out a fax on your behalf. We used http://www.interpage.net.

set subject "fax -extword codes"
set body "<extended begin codes>
-select fax -code **accnt-number** -font courier -fontsize 10 
-nocover -confsucc **admin-email-address** -conffail **admin-email-address**
$fax_digits_only
<extended end codes>

<<begin fax>>
$final_letter
<<end fax>>
"

ns_sendmail "**clientaddr**@interpage.net" "postmaster@scorecard.org" $subject $body

Note that the actual sending of the email is left to the AOLserver API procedure ns_sendmail.

Making it fast

100th Anniversary Boston Marathon (1996).

Thanks to Environmental Defense's public relations savvy and the public's interest in toxic exposure data, we found ourselves in the business of hosting one of the Internet's most popular database-backed Web sites on a 70 MHz computer whose performance would not, even at the time, have satisfied a video game-addicted 10-year-old. Yet even when handling 20 requests a second the page loading times from scorecard.org were faster than those from many static Web sites.

How did we accomplish this? By not doing anything clever, modern, or advanced. We kept the data in Oracle 8 and gave Oracle enough RAM to cache most of the data set. We used AOLserver, an extremely simple Web server program with a built-in script interpreter. We used a scripting language rather than layers of Java classes.

The Bottom Line

Scorecard has made lots of people angry. It has made lots of people think. It has helped schoolkids and university students learn about toxics and environmentalism. It has made lots of companies pay attention to their neighbors. Partly due to pressure engendered by Scorecard's appearance on national TV news, the chemical industry accelerated a testing program for toxicity from 20 years to 5 years.

If Scorecard is something that looks hard but was easy, the Action Network (actionnetwork.org) is something that sounds easy but was in fact way hard.

Influencing Decision Makers

Christopher Alexander argued in his 1977 classic A Pattern Language for countries of no more than a few million people:

"It is not hard to see why the government of a region becomes less and less manageable with size. In a population of N persons, there are of the order of N^2 person-to-person links needed to keep channels of communication open. Naturally, when N goes beyond a certain limit, the channels of communication needed for democracy and justice and information are simply too clogged, and too complex; bureaucracy overwhelms human process. ...
"We believe the limits are reached when the population of a region reaches some 2 to 10 million. Beyond this size, people become remote from the large-scale processes of government. Our estimate may seem extraordinary in the light of modern history: the nation-states have grown mightily and their governments hold power over tens of millions, sometimes hundreds of millions, of people. But these huge powers cannot claim to have a natural size. They cannot claim to have struck the balance between the needs of towns and communities, and the needs of the world community as a whole. Indeed, their tendency has been to override local needs and repress local culture, and at the same time aggrandize themselves to the point where they are out of reach, their power barely conceivable to the average citizen."

The United States is 100 times larger than Alexander thought prudent and indeed most people feel like inhabitants rather than citizens participating in decision-making.

For an illustration of the limits of government power in the Microsoft Age, see the July 13, 1998 issue of Government Computer News (http://www.gcn.com/archives/gcn/1998/july13/cov2.htm) which describes the Navy's $1 billion Aegis missile cruiser Yorktown being towed into port after some problems with its Windows NT cluster.

If you have a huge bank account, you don't need a Web server to influence politicians. You can go to Capitol Hill and hand out checks. However, if you forgot to get rich during the Great Internet Boom then you might have to explore other options.

At least in theory, a politician is responsive to constituents. If a large number of constituents call, write, or fax the politician then the politician's vote might be influenced. An advocacy group will traditionally contact its members by telephone or mail asking them to contact their congressmen and senators. Since only a few percent of people will respond to this appeal, it is extremely expensive to get one constituent to contact one politician. Furthermore, many of those contacts will occur after the vote has occurred; many of the most destructive laws are passed within just a few days.

In order to really make this effective, we need

a cheap way of alerting group members to the impending decision, before it is too late for the members to voice their opinion.
a way of saving group members the trouble of figuring out who their politicians are and how to contact them.
a way of telling professional lobbyists, before they walk into a politician's office, that "15 faxes have been sent by group members to this person so far".
a way of informing members who responded that "your response was communicated to Jane Senator and she voted the right way" or "your response was communicated to Fred Congressman and he voted the wrong way".
a way of reminding Jane and Fred, after they've voted, that their performance was conveyed to X thousand members of the group.

Suppose that one advocacy group was able to do this for themselves? They would necessarily have to run some kind of computing system plus go to the effort of maintaining politician contact information. Why not then add one more system requirement:

a way of letting other advocacy groups use the same system

Our Solution

We have a cheap way of contacting members: email. It also happens to be fast enough that members will be alerted well before decisions are made. We can send email notes to members asking them to come to our Web site and draft a note to their representatives. How do we know which these are? We keep a database of politicans, their states and district numbers, and their contact information. For each activist, we ask for a street address and five-digit zip code. Using software lifted from the Bill Gates Personal Wealth Clock, we can use these data to query a variety of other Web sites and get the member's ZIP+4 and congressional district. To deliver the communications, we use the same Interpage email-to-fax service that we used in Scorecard.

The rest is mere programming! How do we make this technology available to like-minded advocacy groups? Add a few extra tables to store information on authorized groups and to map users to those groups.

What if I Don't Want To Talk to My Politicians?

If you've ever watched an American politican on television or read John Jackley's brilliant book Hill Rat, you might not want to communicate with politicians. Is this software then useless?

No! We actually built it so that it could also be used any time a large number of people need to influence a small number of people. For example, a large company might decide to use the software to assist in collaborative decision making. Joe Employee would contribute his opinion to the system and that opinion would get shown to Joe's boss, Joe's boss's boss, and on up the line. However, Joe's opinion would not be shown to managers in sister divisions. Thus managers would get a feel for what their own employees had to say. In this intranet example, the executives being lobbied are presumably cooperating in the process. There would thus be no need to fax opinions to decision makers; email or Web delivery of summaries would be preferred.

How Does it Work? (short)

We keep the following in a database:

a table of all the organizations using the system: an_domains (note the an_ prefix; it lets these tables coexist in a database that is also used for other applications). For each organization, we need to know whether or not they are US-based (in which case things like the state or ZIP+4 matters), how much information they wish to collect from new members, and how much we're going to charge them to send faxes and email messages on their behalf (we'll have to recycle the software from the ecommerce chapter).
a table of all the activists: an_members. For each member, we store contact information, whether or not email has started bouncing from his address, what congressional, state senate, and state house districts he is in, and his history of responding to alerts.
a table recording to which advocacy groups particular members belong, an_member_domain_map. Note that this table can also contain a "member ID" key into a group's private database on their own computer system (on the assumption that most groups using the system will have their own systems for keeping track of members).
a table recording which members are authorized to administer a domain: an_maintainer_domain_map. If there is a row present in this table, a person will be permitted to send out alerts, view statistics from old alerts, etc.
a table of decision makers: an_targets. This contains contact information, including the fax number, but more importantly the state and a filter_value column that can be used to match a target with members from, for example, only one congressional district.
two tables to group decision makers: an_target_groups and an_target_group_map.
a table of all the alerts that have been sent out: an_alerts.
a table specifying that alerts are to be sent to particular members: an_alert_member_map
a table specifying that alert responses are to be sent to particular targets: an_alert_target_map. A row in this table can say, "Send all responses from members in any state to the EPA District Administrator," or "Send responses from members to the group of U.S. Senators, matching by state," or "Send response from members to the group of U.S. Congressmen, matching by state and then congressional district."
tables recording mass uploads of members with enough information to remove some or all of those members if, for example, a domain administrator uploaded 10,000 records with the first and last names reversed.

Then, uh, we write some Web server scripts.

How Does it Work? (long)

Not your average covered bridge photo. New Hampshire.

The Action Network gives us an opportunity to talk about a bunch of interesting ideas:

database triggers
PL/SQL
Oops-proof Batch Upload
high-efficiency/high-security user authentication

Database Triggers

One of the things that we record about an actionnetwork.org member is his or her original registration date:

create table an_members (
	member_id	integer not null primary key,
	email		varchar(100) not null unique,
	password	varchar(50) not null,
	registration_date	date,
        ...
);

There may be lots of application programs, in our case typically Web page scripts, that add members. If we want to make sure that registration_date is always set, we could either inspect every application program and write documentation instructing future programmers that this column must always be set to the current date. Or we could add an Oracle trigger:

create trigger an_member_registration_date
before insert on an_members
for each row
when (new.registration_date is null)
begin
 :new.registration_date := SYSDATE;
end;

It is now Oracle's job to make sure that member insertions that don't have a registration date will get one. Whom would you rather trust? Oracle or the pimply faced kid that is likely to take over the system that you developed?

PL/SQL

Suppose that a computation is needed in a variety of places on your site. You want it to be consistent everywhere, i.e., that the answer for inputs X, Y, and Z is the same on every Web page that performs the computation. Novice programmers think this is easy. They will just put the same mathematical expression in all the Web pages. Expert programmers know that one day someone will change seven out of eight of those pages, leading to subtle inconsistencies because of that one unmodified page.

Depending on your Web server, there are various ways to encapsulate this computation so that you only need to change the formula in one place. If you were using Microsoft .NET you'd use a public accessible class. If you were using AOLserver, you could define a Tcl procedure that gets loaded at server start time. However, what if a variety of applications all talk to the same Oracle database? Perhaps your public Web site is AOLserver but someone else has built a maintenance application in Microsoft IIS/ASP.NET and still other folks use Crystal Reports and Microsoft Access to query the tables?

Oracle lets you write little programs and run them inside the database server. Thus any application that is querying the Action Network tables can also invoke these little programs, which are often referred to as stored procedures. If you change your mind about the formula, you only have to change it in one place. With the most modern databases, you can write these programs in a variety of standard computer languages such as C# or Java. With older versions of Oracle, however, you have to use Oracle's proprietary PL/SQL language:

create or replace function probability_of_response (
n_times_notified IN integer,
n_times_responded IN integer)
return number
as
begin
  if n_times_notified < 5 then
    -- not enough experience, just use a default
    return 0.1;
  else
    return n_times_responded/n_times_notified;
  end if;
end probability_of_response;

This function takes two inputs, both integers, that say how many times a member has been alerted and how many times that person has responded. The return number tells Oracle that this function must always return a number, in this case the estimated probability of a member responding to a new alert. If the activist hasn't been notified at least five times, we assume that we don't have enough data for statistical significance. Thus the function returns a default value of 0.1. Otherwise, it does the division. Note that the presence of n_times_notified < 5 prevents this function from generating a divide-by-zero error.

For more on PL/SQL read Oracle9i Pl/Sql Programming (Urman 2001) and the the O'Reilly books, of which Feuerstein's Oracle Pl/Sql Programming is the best known. Keep in mind that though it might be more fashionable to build stored procedures in C# or Java it is often clearer and more concise to write in a database-oriented language such as PL/SQL. The real benefit to using a more conventional programming language would be to take advantage of library code that was available in that language.

Oops-proof Batch Upload

It is often useful to allow users to upload huge tables of information, formatted as comma-separated values (CSV) files. The CSV format is convenient for users because it is native to popular desktop spreadsheet applications such as Microsoft Excel and yet is not a proprietary format that can be changed at the whim of a vendor.

Here are a few applications for user-uploaded structured data:

The administrator of an advocacy domain in Action Network ought to be able to upload a database of several thousand members.
A participant in a photo sharing community ought to be able to upload a table of captions of the photos that he has taken (thus enabling a student or art director to say "show me people who have photos of the Eiffel Tower").
After you are discharged, a hospital ought to be able to upload your visit history to your personal medical record server, thus making these data available ten years later if you go for treatment to a doctor in a foreign city.

There are a couple of strategies for mapping these data into your tables. I'll call the first strategy "AOLserver-style" because it is what AOLserver gives you out-of-the-box when you request a URL of the form "foobar.csv". What AOLserver does is show you the first 20 rows of the CSV file and, at the top of each row, lets you pick a database table column into which you'd like the input data column to be stuffed:

The second strategy is to be lazy/mean. The Web site publisher specifies a format and the user is expected to manipulate his data in a desktop app until the data fit the specification. For the Action Network system, the lazy/mean approach would be to say

Here is a list of columns for each row (* fields are required to be non-empty):

foreign_key a unique key in your database system, e.g., "member id"
source where you got this member, e.g., "bread_and_puppet_booth"
member_type e.g., "inner_circle"
email * must be a full Internet email address, e.g., "foo@bar.com"
first_names * e.g., "Lisa Marie"
last_name * e.g., "Presley"
...

`foreign_key`	a unique key in your database system, e.g., "member id"
`source`	where you got this member, e.g., "bread_and_puppet_booth"
`member_type`	e.g., "inner_circle"
`email *`	must be a full Internet email address, e.g., "foo@bar.com"
`first_names *`	e.g., "Lisa Marie"
`last_name *`	e.g., "Presley"
...

and then encourage the user to do his or her best to get the data into this format.

In the AOLserver-style case, you have to anticipate the domain administrator who chooses "last name" for the column containing first names. In the lazy/mean case, you have to anticipate the domain administrator who produces a CSV file in which the columns are not in the expected order or in which a column is left out by mistake.

In both cases, it is possible that 10,000 bogus records will be inserted into the database.

Oops.

Shouldn't we be able to solve this problem with SQL transactions? We'll just tell the RDBMS to open a transaction before inserting the first record and won't tell the RDBMS to commit the transaction until after the 10,000th record has gone in and been inspected. An insurmountable problem with this idea is that the person uploading the data might not realize for a few days that something was wrong, particular if the error was swapping two little-used columns.

An obvious fix: "As the program is inserting a new member, stuff some crud into the database recording the insertion as part of Upload #3271." What's unattractive about this idea is that it seems that one would need to write slightly different software for the photo captioning upload, the Action Network upload, and the medical record upload. Couldn't one write a general piece of software that would work for all three applications?

Back in the late 1990s, it seemed to me that the problem was very similar to the transaction processing monitors ("TP monitors") in use throughout Corporate America. Suppose that taking an order from a customer involves inserting records into a factory scheduling database in Indiana (Oracle), a billing database in Pennsylvania (old IBM mainframe with non-relational database), a management reporting system in New York (Informix), and a shipping scheduling system in California (Sybase). If any one of these systems is down or unreachable, you want all the other systems to be untouched. Yet you can't ask the Oracle system in Indiana to roll back changes made to the old IBM mainframe database in Pennsylvania.

As far as I could tell, the TP monitors worked kind of like the software described in the ecommerce chapter: before trying to do anything on a foreign system, it records what is about to try to do. Afterwards, if something bad happens on any of the foreign systems, it can use this record to undo what it did.

Not wishing to buy and install one of these systems, or read the white papers for BEA Tuxedo and M3 at http://www.beasys.com, I decided to take advantage of my privileged position as an academic computer scientist. I went to the world's leading authorities on transaction processing, folks who teach on this subject every semester and publish scholarly articles in refereed journals, and asked

"Can you suggest a general algorithm that, given an SQL insert statement, will produce another SQL insert that writes enough information into the database to undo the first insert? Is this how commercial TP monitors work?"

Their answers were instructive. As of 1998, none of the academic experts on transaction processing had considered this problem. None were aware of the existence of commercial TP monitors, much less how they functioned. The final thing that I learned is that if the ability to sit down at Oracle or SQL Server and type a single legal SQL statement were a condition of employment in university computer science departments there would be plenty of job openings for new PhDs.

Our ultimate solution to this problem? As the program is inserting a new member, stuff some crud into the database recording the insertion as part of Upload #3271:

create sequence an_csv_upload_sequence;

create table an_csv_uploads (
	upload_id	integer primary key,
	upload_time	date,
	-- who did the upload and from where 
	member_id	integer not null references an_members,
	originating_ip	varchar(50),
	-- for which domain
	domain		varchar(20) not null references an_domains,
	-- the original file name on local disk
	original_filename	varchar(300)
);

-- this table records which members were added during an upload

create table an_csv_upload_rollback (
	upload_id	integer not null references an_csv_uploads,	
	member_id	integer not null references an_members
);

What about the Tcl code that populates these tables? First, the file upload form:

<form method=POST enctype=multipart/form-data action=members-upload-2.tcl>
<input type=hidden name=domain value="$domain">
Pick a file: <input type=file name=csv_filename size=50><br>
<p>
<center>
<input type=submit value="Upload">
</center>
</form>

Note the use of the enctype specification in the form tag. This tells the user's browser to MIME-encode the file chosen with the input type=file widget. How do we process this MIME-encoded file? AOLserver decodes it automatically and puts it in a /tmp file on the server. Calling ns_queryget csv_filename.tmpfile will get the temporary filename, which can then be read using the standard Tcl language commands.

Here's the main loop of the upload processor:

set upload_id [database_to_tcl_string $db "select an_csv_upload_sequence.nextval from dual"]

ns_db dml $db "insert into an_csv_uploads (upload_id, upload_time, member_id, originating_ip, domain, original_filename)
values
($upload_id, sysdate, $member_id, '[DoubleApos [ns_conn peeraddr]]','$QQdomain','[DoubleApos [ns_queryget csv_filename]]')"

# let's grab the csv file

set stream [open [ns_queryget csv_filename.tmpfile] r]

set first_iteration_p 1

# we call the AOLserver API procedure ns_getcsv to read the next line 
# from the CSV file, decode it, and put the values into the Tcl list 
# one_line

while { [ns_getcsv $stream one_line] != -1 } {
    if { $first_iteration_p && ([lindex $one_line 0] == "foreign_key") } {
	ns_write "<li>skipping the first line because it looks like column headers\n"
	continue
    }
    set first_iteration_p 0
    # let's destructure the list first, using the magic 
    # procedure an_destructure_record_list that will set
    # a bunch of local variables
    an_destructure_record_list
    # now all the column vars are defined in our local env, 
    # including $QQ versions for use with the RDBMS
    ns_write "<li>Working on $first_names $last_name... \n"
    # look for existing members either by foreign_key or email address
    # (code removed for clarity)
     ...
    # if we got here, we didn't find a record with the same email
    # address or foreign_key
    set new_id [database_to_tcl_string $db "select member_id_sequence.nextval from dual"]
    # one of the few tricks below is the "string toupper" on the state code
    # we want to forgive folks who use "ma" or "Ma"
    if [catch { ns_db dml $db "begin transaction"
                ns_db dml $db "insert into an_members (member_id, email, ...)
values
($new_id,'$QQemail', ...)
                ns_db dml $db "insert into an_member_domain_map (member_id, domain, foreign_key, source, member_type)
values
($new_id, '$QQdomain', [ns_dbquotevalue $foreign_key text], [ns_dbquotevalue $source text], [ns_dbquotevalue $member_type text])"
                ns_db dml $db "insert into an_csv_upload_rollback (upload_id, member_id)
values 
($upload_id, $new_id)"
                ns_db dml $db "end transaction"
    } errmsg] {
	ns_db dml $db "abort transaction"
	ns_write ".. ouch!  Here was the error from Oracle:\n$errmsg\n"
    } else {
	ns_write ".. inserted"
    }
}

close $stream

At any time after an upload, a domain administrator can

review upload history
remove all the members added by a particular upload
remove some of the members added by a particular upload (using checkboxes)

User Authentication

A lake just above Rocky Gorge, off the Kancamagus Highway, New Hampshire

On a site like Action Network you want to be pretty sure that a guy who says he is Joe Smith, authorized maintainer of a domain, is actually Joe Smith. Although Rush Limbaugh is one of the great heroes of the Internet Age (he met his wife on-line), Environmental Defense would probably not want him using their server to send out messages to thousands of environmentalists.

One glorious day in the distant future standard computers will be augmented with authentication hardware that talks to a card or a key that users carry around. The physical device plus a PIN will then suffice to authenticate them all over the Internet, much as an ATM card plus PIN is used in our current banking system.

For now, we have passwords. Here's the simplest reasonable way to do authentication:

Ask the user for an email address and password combination.
Look up the person's user_id in the database (by the email address) and then check to see if the password matches (I like to do this case-insensitive).
If the password matches, write the password from the database plus the user ID into a browser cookie.
On subsequent page requests for URLs that require authentication, look for the user_id cookie and then query the database to see if the password matches.

Using a cookie saves the user from having to enter his password repeatedly. If you use HTTPS all the transmissions of cookies are encrypted so that the password/user_id combo is not subject to packet sniffing attacks. We have a potential performance problem in that we're querying the database on every page load. Oracle will have cached the query plan and the data but still it seems ridiculous to ask Oracle 100 times on 100 page loads for "user 456's password".

Although servers are probably more than powerful enough to handle the extra database queries, why not save the power for more interesting applications and/or those big days when a site gets into the news? A simple solution is to cache retrieved passwords in the server's virtual memory, in some data structure that is accessible to all the threads. When a request comes in, the server script checks the cache first. If there is no entry or if the passwords don't match, then it asks the database for the password. Note that this algorithm works even if a user changes his password. Here's some example code:

# we tell AOLserver to run our cookie checker procedure before
# serving any request for a URL that starts with "/pvtm" 
ns_register_filter preauth GET /pvtm/* an_verify_member_filter
ns_register_filter preauth POST /pvtm/* an_verify_member_filter

proc an_verify_member_filter {args why} {
    set member_id [an_verify_member] 
    if { $member_id == 0 } {
	ns_returnredirect /
	# returning "filter_return" causes AOLserver to abort
	# processing of this thread
	return filter_return
    } else {
	# got a member_id and the password matched
	# returning "filter_ok" causes AOLserver to proceed
	return filter_ok
    }
}

# return member_id if happy, 0 otherwise

proc an_verify_member {} {
    set headers [ns_conn headers]
    set cookie [ns_set get $headers Cookie]
    if { [regexp {an_member_id=([^;]+)} $cookie {} member_id] } {
	if { [regexp {an_password=([^;]+)} $cookie {} urlencoded_password] } {
	    # got member_id and password, let's check it 
	    # we urlencode/decode passwords so that users are free to
	    # put in semicolons and other special characters that might
	    # mess up a cookie header
	    set password [ns_urldecode $urlencoded_password]
	    # the ns_share API call instructs AOLserver to treat an_password_cache
	    # as a global variable; note that we're using it as a Tcl
	    # associative array
	    ns_share an_password_cache
	    if { [info exists an_password_cache($member_id)] && \
                 ([string compare $password $an_password_cache($member_id)] == 0) } {
		# we had a cached password and it matched
		return $member_id
	    } else {
                # we need to talk to Oracle
		set db [ns_db gethandle subquery]
		set an_password_cache($member_id) [database_to_tcl_string $db \
                        "select password from an_members where member_id = $member_id"]
                # we explicitly release the database connection in case
		# another filter or the thread itself needs to use one
		# from this pool
 		ns_db releasehandle $db
		if { [string compare $password $an_password_cache($member_id)] == 0 } {
		    # passwords match
		    return $member_id
		}
	    }
	}
    }
    # we didn't get a match or a member_id or something...
    return 0
}

Paul Holbrook, one of the guys who made www.cnn.com so fast, sensibly sent email asking "How would you prune such a cache? If you have a large number of users, such a cache could grow very large." Here was my response:

Paul, you will be impressed by my advanced cache pruning algorithm: I
have another AOLserver running that kills my primary server once/day! So
the cache builds up over only a 24-hour period.

Generally I architect all of my Web services so that AOLserver or Oracle
can be restarted at any time without even affecting a current user
"session". So in theory the server could be restarted every two hours or
whatever.  In practice, a server restart can affect a handful of users.
When AOLserver gets a kill signal, if it is in the middle of serving
files to users, it may stay alive for a configurable period of time
(default 60 seconds) to finish serving them.  So anyone connected by a
reasonably fast link shouldn't notice that the server was killed halfway
through serving his request.  However, during this shutdown period,
AOLserver does not respond to new requests.  Browsers generally time out
after 30 seconds so if a very busy AOLserver serving large files to at
least some modem users is killed, a few users will get a "server did not
respond" message and have to retry.

Let's talk about the cache size, though. Suppose that each userid +
password takes up 100 bytes. If I have one million users visiting during
my 24-hour period, the cache would only be 100 MB in size. Given that
AOLserver runs as a single process (i.e., these data aren't duplicated
in multiple processes, as they might be with Apache), this wouldn't be
an unsupportable load on my HP-UX box (with its 4 GB of RAM).

In practice, my pathetic site is only visited by at most 25,000 people
per day :-( And probably only half would be users who've registered in
some way (e.g., by posting a Q&A forum message). So my daily cache would
never be more than about 1 MB.

For higher security over unencrypted connections, it would probably be better to issue encrypted tickets that are good for a limited period of time and only from a particular IP address.

Summary

A misty morning a little bit north of Woodstock, Vermont

As Philip Elmer-Dewitt wrote in TIME Magazine, "The Internet is more than a place to find pictures of people having sex with dogs." The Internet could also be more than a place to buy exactly what we want. For those non-profit organizations that have the will, the Internet can be a tool for making our world a better place.

Wallace Stevens's "Thirteen Ways of Looking at a Blackbird" is available from http://www.poets.org/poems/poems.cfm?prmID=1724
aside from Jin S. Choi and myself, the folks who built scorecard.org are all at www.getactive.com.
A Civil Action (Harr 1995; Random House) documents the litigation by leukemia victims in Woburn, Massachusetts against W.R. Grace and Beatrice Foods, whose factories released trichloroethylene. The book on tape version is very compelling and requires an investment of only a couple of hours.
A Pattern Language (Alexander 1977; Oxford University Press).
Hill Rat (Jackley 1992; Regnery Gateway) chronicles a few years in the life of a Congressional staffer; you won't know whether to laugh or cry.

or move on to Chapter 17: A Future So Bright You'll Need to Wear Sunglasses

philg@mit.edu