Chapter 14: Sites That Don't Work (And How to Fix Them)by Philip Greenspun, part of Database-backed Web Sites
Note: this chapter really suffers on the Web. Macmillan did some lovely napkin drawings that you will only be able to see if you buy a copy on real dead trees. I hope to some day convert these for the Web but I'm too busy at present.
This chapter looks at the evolution of the systems that I've used for three years to manage user feedback and related links. It demonstrates that solutions appropriate to a new site with 10,000 hits per day are time-consuming failures when applied to a mature site with 500,000 hits per day. I'm going to share with you a bunch of site and software design ideas that eventually failed, study why they failed, and then share my ultimate solution.
I peppered my first Web projects with user feedback forms: "Click here to tell me how you liked Travels with Samantha." From this, I learned two things:
I reduced the severity of the first problem by changing to a "mailto" tag instead of a form. After I made the change, I would only get bounces when replying to users who haven't correctly filled out their browser's options dialog box. I reduced the severity of the second problem (sheer volume) by discreetly burying my e-mail address at the bottom of pages and refraining from explicitly requesting feedback. Users with a really heartfelt response to something would generally take the initiative and send me long thoughtful messages. Users who'd formerly just sent a line or two didn't bother. Every now and then I'd mine the most thoughtful comments from my e-mail archives and add them to http://webtravel.org/samantha/other-voices/.
So much for comments. That still left me with ten or 20 questions every day. These fell into the following categories:
1. I scanned some pictures for my Web site and they look terrible. Yours look much better and they are one-third the size. How did you do it?
2. I was thinking about buying Camera X for purpose Y and want to know if you think this is a good idea.
3. My routine job is numbing my brain. I want to have a fabulous varied life full of travel and new experiences like yours. How do I do it?
Throughout this chapter I'm going to refer to my personal site (http://photo.net/philg). I thought it would be helpful background to present five graphical snapshots of the site taken at various times from 1994 through 1997 (Figures 14-1 to 14-5).
Figure 14-1: My pathetically wimpy Web site in early 1994. I offered a bunch of research papers that I'd written as big PostScript files. Most of the traffic was for two travelogues with photos and an exhibit of zoo photos, Heather Has Two Mommies. The latter was my attempt to write a charming children's book. After about one hour, though, I discovered that I was missing something fundamental: talent. So I changed the title to the funniest children's book title I could find and turned it into an uninspired parody of political correctness. It all happened between midnight and 6 am on one strange night. To this day I get e-mail from folks saying "I listen to Rush Limbaugh every day and I'm so happy to have found a kindred soul on the Internet." I'm half-tempted to delete it but it gets at least 5,000 hits/day.
Figure 14-2: After Travels with Samantha won Best of the Web '94, I decided to respond to Internet demand for more travelogues with an extensive New Zealand story and a short piece on the Cayman Islands. Meanwhile, I added a photography section in hopes of more efficiently dealing with e-mailed questions (see text). I also started a narcissism section containing my resume and other personal information to answer another category of e-mailed questions.
Figure 14-3: By mid-1995, I had decided to experiment a bit with my travelogues, making them more useful as travel planning tools. I struck a deal with Moon Publications to reprint portions of their guidebooks to Costa Rica and New Zealand. I arranged with some experts on the ground in Costa Rica to answer questions. I also got serious about offering Web publishing advice and started the Web Tools Review page that grew into this book.
Figure 14-4: Toward the end of 1995, I got bored with adding more static pages. I started to use my personal site as a proving ground for Web collaboration ideas. I added a classified ad system, threaded discussion/Q&A forums, a registry for stolen cameras, and a neighbor-to-neighbor service for people to record their experiences with camera retailers, wedding photographers, individuals selling equipment on the Internet, and so forth.
Figure 14-5: My personal site as of March 1997. The interesting trends to note are that I've added collaboration to virtually every page on my server. Users can add comments and/or related links to my pages. I'm running the classified ad software for Help Wanted web developer ads in Web Tools Review, for travel services in Web Travel Review, and for cameras in photo.net. I've added an Ask Philip Q&A forum under Narcissism. I'm offering all of these collaboration tools free to other Web publishers as services running from my server at Primehost. I'm publishing more content from other authors, ranging from travelogues to camera reviews. I've also rediscovered the joy of writing static pages and have added new travel sites of my own (Italy and California) plus an entire materialism section reflecting my experience as a consumer.
I disposed of Category 1 questions and a few other very specific ones by writing a FAQ for Samantha and linking from there to an explanation of my process for batch-converting images from Kodak PhotoCD (http://photo.net/philg/how-to-scan-photos.html; which I eventually replaced with Chapter 4 of this book). That cut down the number of people asking by a factor of seven or eight. The remaining "how to scan" questions that did come through were easier to answer after I augmented my Emacs a little bit so that a few keystrokes adds the appropriate URL to an e-mail reply.
So satisfied was I with this situation that I attempted to deal with the Category 2 questions in the same manner. I built a photography page (http://photo.net/photo/) and added material to it every time someone asked a question. Eventually, I figured, people would answer their own questions by browsing my page and the photo question stream would trickle out. Instead what happened is that my content spurred more questions. If I said that I liked my Yashica T4 point-and-shoot camera, that spurred literally hundreds of people to ask, "How does it compare to the Pentax/Olympus/Minolta YadaYada-15X point-and-shoot camera?" If I said that you needed a large aperture lens to do portraiture, that spurred a moderate number of questions about what I meant by "aperture." Here's one that came in a few days ago:
"I recently inherited a Ciro-flex 120 box camera (model F). It says made in Deleware, OH, it has a Wollensak lens, 83mm f 3.2. I have taken a few rolls of Kodak color prints which seem to be OK, but I am not experienced enough with 120 film to know what I am doing at this point. Do you know anything about this camera, is it good quality that I can use for outdoor photography or should I keep my 35mm point-and-shoot?"
The interesting thing about this question is that the guy asking it is probably the only person on the Internet, perhaps the only person in the world, capable of answering it. He has an apparently functioning example of an obscure camera that hasn't been manufactured for at least 50 years. He has taken "a few" rolls of pictures with the camera. He presumably has compared these pictures to those from his modern 35mm camera. I'm never going to get another question like this and, even if I did get a regular stream of questions about "Ciro-flex" cameras, I lack the experience to answer them myself.
My next attempt to deal with this flood was my Q&A forum software. I would use this to dispense photo advice too fragmented to put into .html files in photo.net. As with an .html file, though, I'd only have to answer each question once. Users on the point of asking a question would instead find the answer already posted in the forum. Furthermore, I'd capture wisdom and experience from other readers (maybe even some who'd used a Ciro-flex camera!).
At first, the forum was a tremendous success. I was surprised at the depth of knowledge of many of my readers and their eagerness to help novices. I no longer had to explain what an aperture was. Remember that my forum software directly e-mails responses to the original poster. It uses the ns_sendmail AOLserver API call and fakes headers so that if the novice replies with a follow-up, it goes to the responding reader. They can have a lengthy e-mail exchange without me even seeing it.
Should one of my eager answerers say something that is inaccurate, I can just add my own answer. An answer from email@example.com sorts right at the top (since I'm the forum maintainer). So I can leave a partly correct answer below if it has any pedagogical value. Alternatively, I can just delete erroneous answers from the admin interface.
Everyone was winning. I was answering more distinct questions. My sophisticated readers were directly answering my naive readers. A lot of folks were checking the archives before posting duplicate questions.
After collecting about 2,000 postings (both questions and answers), the Q&A forum seemed to be failing. A lot of users were covering the same ground with "What kind of camera should I buy" questions? At first, I dealt with this by adding a generic "Please search this bboard before asking a question" form. This helped a bit, but in the end I added a column to the bboard_topics table with a customizable "pre_post_caveat." See Figure 14.6 for what the "Post New" form looks like now.
Figure 14.6: After collecting about 2000 postings, my http://photo.net/photo/ Q&A forum began to fall apart. Users were posting new questions that had already been asked and answered. At first, I added the generic "search this forum first" note. Later, I decided that I needed the ability have a custom pre_post_caveat for each bboard topic. This figure shows the user interface in place as of March 1997.
Am I done? I don't think so. There are 478 top-level postings in the forum. Even with the categorization, that's becoming far too many for users to browse. Furthermore, I haven't been savage enough in my editorial judgment. There are a lot of threads that might be worth keeping around for the full-text search but that aren't worth presenting at top level. For example, one reader asks whether a Tamron 90/2.8 macro lens is any good. Another reader responds that he likes his older Tamron 90/2.5 lens, which is a different optical design. That's the whole thread. I'd hate for someone to stumble into it because it doesn't really teach anything. On the other hand, I don't want to delete it because it might be useful to someone in the market for a used Tamron 90/2.5 lens.
I think it is time to add a column to my bboard table. It will be called interesting_p and will be "t" if the thread is interesting, "f" if not. Then I can reprogram my top-level page to only show the interesting threads. The full-text search service will still consider all the postings in the table. So the average reader won't be distracted by this posting whereas the searcher for "Tamron macro" will find it instantly.
Who is going to decide whether a thread is interesting? I could do it, but I'm already spending 30 minutes a week maintaining the forum (mostly deleting uninteresting postings and categorizing miscategorized threads). I think I will build in a facility to allow the group to collaboratively decide that a posting is uninteresting and/or allow a group of experts selected by the maintainer to decide.
Note: In addition to starting photo.net to handle the photography questions, I started Web Tools Review (http://webtools.com/wtr/) to handle the Web publishing questions. It worked pretty well for a while but then the information started to get out of date and the page became an embarrassment. Then Simon Hayes at Ziff-Davis Press approached me with the idea of revamping the material and turning it into a book. You're reading it now. So what started out as a way to save time answering reader e-mail turned into the job of writing 450 manuscript pages of text plus figures and screen shots. How much fun is that? Winston Churchill had something to say on this subject: "Writing a book is an adventure. To begin with, it is a toy and an amusement; then it becomes a mistress, and then it becomes a master, and then a tyrant. The last phase is that just as you are about to be reconciled to your servitude, you kill the monster, and fling him out to the public." Anyway, I guess I have to say that the idea of building Web Tools Review as a time-saver has failed miserably.
These are the questions that break my heart. Some poor guy trapped at a desk job 40 hours a week. He has read all 19 chapters of Travels with Samantha and my 60-page New Zealand story and my 50-page Costa Rica story. Now he wants to bust out and be a footloose free spirit like me.
I really ought to program my Emacs to autorespond with "All you have to do to be a free spirit like me is develop 50 RDBMS-backed Web sites. Then you can be woken up in the morning by a former customer whose full-time programmer has quit and left them with a bug; they want you to telnet into their Unix box and poke around and fix it. Then you can spend 120 hours a week at your desk maintaining various Web servers. Then you can go away for the weekend and find 250 e-mail messages from your readers in your inbox."
My big objective through all of the preceding endeavors was ducking e-mail. This would have been fine if I still had an interesting life to write about. Then I could have enhanced my site periodically with new stories. Unfortunately, as I noted above, I'd taken to spending entire weeks at home in front of my terminal. I'd become so boring that I started to quote Andrei Gromyko. (When he was Soviet ambassador to the U.S. and a reporter asked him a question about his family, he'd say "My personal life doesn't interest me.")
I decided to turn to my readers. They would invigorate my site by adding interesting comments to each of my pages. I could use a relational database to collect and organize these comments, then some simple programs to present them. But my personal server was on an old HP-UX machine at the MIT Laboratory for Computer Science. This computer was already struggling to handle my static pages. It certainly couldn't handle running an RDBMS that got queried on every page load. A script-and-RDBMS-backed site is never as reliable as a static site. I didn't want to reduce the reliability of my large body of static material in order to serve a small body of dynamic material. I didn't want to spend the rest of my life doing system and database administration either.
I ended up building a comment server on a separate computer, a multi-processor SPARCserver 1000 already running Illustra and AOLserver to reliably deliver a million RDBMS-backed hits per day. Conveniently, this box had an on-call system/database administrator paid for by the commercial Web publisher who owned it. What would have been a lifetime project if done at MIT could now be accomplished with a few days of programming.
How does my Loquacious system end up working? Suppose I have a static page such as http://photo.net/photo/point-and-shoot.html. I simply add a reference to the comments page at http://db.photo.net/com/philg/photo/point-and-shoot.html.
The "/com" tells the AOLserver listening on db.photo.net that the user is requesting the Loquacious comment service (as opposed to the other services delivered by this host). The "/philg" tells the Loquacious software that the referencing page is part of the "philg" realm (previously defined and stored in a comment_realms table). The last portion of the comment reference "/photo/point-and-shoot.html" is a URL stub that, if glued to the realm base, will form the complete URL of the referencing page. For example, in the case of the "philg" realm, the URL base ("server_prefix") in the comment_realms table is "http://photo.net/". The first time a comment page is referenced, the Loquacious system attempts to grab the static file and REGEXP out the title. That way it can provide a backlink that says "Comments on Point & Shoot Cameras" rather than "Comments on http://photo.net/photo/point-and-shoot.html" (see Figure 14.7).
http://db.photo.net/com/philg/photo/point-and-shoot.html, the database-generated dynamic comment page for the static page http://photo.net/photo/point-and-shoot.html. Note that the backlink to "Point & Shoot Cameras" was automatically generated by the comment server. The first time a user requested the comment page from Loquacious, it fetched the static page from my static server and REGEXP'd out the title. That's why you don't see "comments on http://photo.net/photo/point-and-shoot.html".
As soon as I'd finished building the AOLserver Tcl/SQL scripts that implement the Loquacious system, I wrote a Perl script to grind over my 1,000 or so static .html files at MIT. The Perl script inserted a legal comment server reference to the bottom of each page. Now virtually all of my pages could accept comments.
Where did this flurry of programming get me? Right back to where I was in 1994. A torrent of comments flooded in, mostly one or two sentences of the "I love this page" ilk. These cluttered up my inbox just as in 1994 because my software e-mails the realm maintainer every time there is a new posting. But now I had to also go to the back-end admin pages and delete these observations from my database.
In an attempt to stop the ratings, I put in cautionary language saying, "This is not meant to collect ratings but only alternative perspectives that might be of interest to other readers." The flood continued unabated.
My next step was to add a second submit button, offering users the option of sending private e-mail to the realm maintainer (see Figure 14.8). This cut down on the amount of crud dumped into the RDBMS but increased my junk e-mail load. A lot of users (at least 30 percent) would submit a public comment, look at the confirmation page that said, "Thanks and, by the way, I've just sent e-mail to firstname.lastname@example.org," then back up and resubmit the same comment as "private e-mail to email@example.com" (Figures 14.9 and 14.10). My defense against garbage in the database remained a couple of very fast back-end administration pages (Figures 14.11 and 14.12).
My second attempt at an "add my comment" form; I've added the option of sending private e-mail to the comment realm maintainer (in this case "firstname.lastname@example.org") instead of adding something to the persistent database.
Confirmation page from Loquacious following addition of a public comment. Note that the user is explicitly informed that the comment realm maintainer (in this case "email@example.com") has been notified via e-mail. Nonetheless, at least 30 percent of commenters backed up and resubmitted the same comment as a private e-mail message!
A typical e-mail notification of a new comment received. Note that this was a public comment, rather than a private message, and it is a question rather than the informed alternative perspective that I seek. When reading this message in Netscape Navigator, I am just one click away from the administration page (see Figure 14-11).
The administration back-end for comments on just one page. This is not a very convenient way to edit five recently received and bogus comments, though, which is why I built the quicker interface in Figure 14.12.
With this super administration page, I can go through a whole month's worth of comments, deleting duplicates, questions and otherwise uninteresting material.
At this point, I stepped back from the system and let it sit for a few months. I had accumulated several hundred interesting comments and therefore could not account Loquacious a failure. On the other hand, I was spending up to 30 minutes each week deleting redundant e-mail and cleaning the database. People wanted to rate my content even though I told them that I wasn't interested in positive comments. Oftentimes, people would say, "This page is great. I want to encourage you to keep it available." I think the average person's experience of the Web is that links go dead within a few months. They presumably assume that the reason for the short-livedness of links is that authors aren't getting enough encouragement and therefore fold up their server tents and go home.
My friend Neil suggested that "Hey, if people are determined to rate your content, why don't you add a system whereby they can rate your content?" So I added a table to my RDBMS and wrote a couple of new Tcl procedures and you can see the result in Figures 14.13 and 14.14. I don't have enough experience with the new system to say whether my feedback problems are finally solved, but I'm optimistic.
Figure 14.13: I replaced the "Add Comment" form with a gateway page, encouraging users to think for a moment about what they want to express. Note that the very first option is simply to add an integer rating, something that won't result in me getting an e-mail message.
Here's the new "Add Rating" form. Note that I collect user e-mail addresses and full names. These are optional fields but might be useful if I change the commented-on page and want to spam everyone who has ever rated it. For example, "You rated foobar.html a 9 on March 26, 1997. I've made a bunch of changes and you'll probably want to come back and re-rate it as a 2."
Note: If you want to run a Loquacious comment realm for your own static site, you can just add a realm to my database server. Just visit http://webtools.com/wtr/ and fill out a form. It will cost you nothing. I maintain the comment server and RDBMS.
When the Web was young, publishers linked to everyone they could find. You linked to people whose content complemented yours. You linked words in your documents to foreign servers with deeper explanations. You built favorite links pages to highlight work that you admired. You did all of this in an attempt to create a richer hypertext environment for your readers.
That worked pretty well when 95 percent of the publishers were students and researchers and there wasn't a get-rich-quick-with-banner-ads stampede.
Nowadays, if you operate a popular Web site, you'll get at least 20 messages per day requesting a link exchange. This is kind of a strange concept if you are operating under the original Web model. You've linked to publishers whose content complements your own and will therefore help your readers. Should you pull those links if the linked-to publishers don't like your site? Suppose I publish a page about Minolta cameras (http://photo.net/photo/minolta/index.html) in which I link to http://www.minoltausa.com/. My page notes that I think almost everyone would be better off buying a Canon or Nikon 35mm single-lens reflex system rather than a Minolta Maxxum. Should I expect Minolta to link back to me? Should I pull my link to them if they won't?
Unable to look at all of the URLs offered, I programmed Emacs to autorespond something to the effect of "I don't try to maintain a comprehensive list of links to the rest of the Internet, even in one subject area. Even if I did, what would be the utility of this to my readers? They have Yahoo. They have AltaVista. Why do they need my half-heartedly maintained list?"
Then one day I had a brilliant insight: The people who started Yahoo are a lot richer than I am. Given that it would take me about one hour to program a self-maintaining Yahoo-style links directory and that there were apparently thousands of people willing to go to the effort of adding entries, I built the BooHoo system.
Here were my design objectives:
The BooHoo system works its way through the links every night or two. If a link is unreachable, its status goes from "live" to "coma." If it had already been marked "coma" by a previous sweep, it is marked "dead" and an e-mail notification is sent to the person who posted it. Dead links are no longer displayed to users. When the sweep gets to a link that is already marked "dead," it is either restored to "live" status (if the server has come back) or removed from the database (actually there is an administrative command to "really remove" the dead links and/or restore them all to "live" status; this ensures that all of your links are not lost if my server or the entire Internet has a few "bad hair days").
I get instant updates for free since BooHoo is fully RDBMS-backed. After a user finishes working through the "Add URL" forms, the data gets stuffed into a database table. The next time a reader requests the related links for that particular page, the new row is pulled from the database.
Why doesn't Yahoo work this way? Probably partly because of history. Yahoo was started by a couple of Stanford graduate students in electrical engineering. Universities are generally a couple of decades behind the times when it comes to database management technology.
Even if Yahoo were engineered with the latest and greatest RDBMS software, it would require a tremendously huge server farm to handle the hundreds of queries per second that an "Internet anchor" service gets. It is vastly cheaper in terms of hardware requirements to periodically grind the data out of the database into static .html files, which is what Yahoo in fact does.
Suppose that Joe Artiste adds a link from photo.net to his site. Unfortunately Joe is hosting on a slow server, has neglected to add WIDTH and HEIGHT tags to his IMGs (see Chapter 4), has chosen a truly offensive background GIF, and offers only the shallowest content. BooHoo instantly notifies me (the owner of the photo.net page) by e-mail when a related link is added. The e-mail notification includes an instant link to the administration page where the link to Joe Artiste may be removed with a mouse click.
More interestingly, suppose that Bill Gates is very enthusiastic about launching Windows 98. He thinks everyone should know about it and therefore believes that his page announcing Windows 98 is related to just about all of my pages. So he has some of his serfs make http://www.microsoft.com/win98/ a related link to all of my pages. BooHoo's blacklisting capability saves me from having to read a flurry of e-mail and then make a flurry of removal mouse clicks. As the administrator of a BooHoo realm, I can say "reject any link that matches the pattern *microsoft.com*". I can specify this pattern for just one page or for all pages known to a BooHoo installation. My software will grind over the database and delete all the links that contain "microsoft.com". Furthermore, any fresh attempts to insert related URLs containing "microsoft.com" will be rejected with an error message to the user.
Like everything else I've written for my personal site, I want BooHoo to be usable by other Web publishers. Consequently, I have a BooHoo system that is designed to run on one server (see http://webtools.com/wtr/) and treat each page as a separately managed item. This means that some nice features, for example "blacklist microsoft.com site-wide", aren't available. On the other hand, it saves publishers from the pain of having to install and maintain an RDBMS.
In terms of cutting down on my e-mail, BooHoo has worked fantastically well. I don't get as many requests for reciprocal links. When I do, I can just autorespond with a description of the BooHoo system. The system has collected hundreds of links. Figure 14.15 shows some of the links users have added to my top-level photo.net page. Figure 14.16 shows the form they use to add a link.
The BooHoo system displays related links to http://photo.net/photo/.
Anyone can add a link from photo.net to their own site simply by filling out this form.
Why don't I consider BooHoo a success? Mostly because I have been too lazy to add Related Links buttons to my static pages. Every time I want to add a new page to BooHoo, I have to fill out a Web form. I am too lazy to do this for the hundreds of static pages on my site that deserve related links buttons. I really ought to rewrite BooHoo to function more like Loquacious so that I need only write one Perl script to add Related Links buttons to all of my static pages in one fell swoop. But I've been too lazy to do that . . .
Nobody is smart enough to predict all of the implications of a software design decision. The first joy of developing Web software is that you find out immediately when you've made a mistake. The second joy is that you never distribute a CD-ROM to thousands of people. Thus you only have to fix your code on the server and all of your users will benefit instantly.
If God had meant you to get it right the first time, He would not have put "alter table" into SQL. If fixing bugs and adding features to online systems handling 20 hits a second were easy, you would not be getting paid $1,250 a day.
Note: If you like this book you can move on to Chapter 15.