Chapter 11: Sites that are really databasesby Philip Greenspun, part of Philip and Alex's Guide to Web Publishing
Revised July 2003
Web applications that replace desktop applications, such as the WimpyPoint system described in the first chapter, are natural candidates for database management software. What about public sites? Let's revisit the four categories presented in Chapter 2:
A Category 1 site, providing traditional information on the Web, doesn't usually start out as a database except in a degenerate way. However when all of the collaboration and personalization systems that most publishers eventually want are added, a growing percentage of the site will look like a database.
Categories 2 through 4 are databases disguised as Web servers. Information is collected in small structured chunks from a variety of simultaneous sources. It is distributed in response to queries that cannot be anticipated when the data are stored. If your site falls into categories 2 through 4 above, you are running a database even if you can't spell "SQL".
As long as one is running a database, one might as well do it right. Building a database-backed Web site requires a hard part that is heavy on the thinking and an easy part that is heavy on the implementation. As an example, let's suppose that our publishing idea is to build a mailing list system for an otherwise static Web site. We want users to be able to add and remove themselves. We need their e-mail addresses and real names. We want only the publisher to be the only person who can view the list and also the only one who can send mail to the list.
This is the step at which you decide whether or not you are going to allow two entries with the same e-mail address, whether you're going to assign some kind of guaranteed-unique key to each entry, and whether to also store the time and date of the entry. Received wisdom in the database world is that every table should have a synthetically generated key yet in this case it is hard to understand what point would be served in having two entries with the same e-mail address. Thus it might be acceptable to use e-mail address as the key.
That's it. You've done the hard part.
Note that it is only during the easy part that you have to think about the particular technologies you are using on your server. The user has no way of knowing whether you are running the latest release of Oracle or whether all the form submissions actually turn into e-mail to someone who manually updates a regular file. The site will still work the same as far as the user is concerned.
In practice, you will probably be using some kind of relational database (see the next chapter). So your programs will be turning HTML form submissions into SQL and results from the database back into HTML. It is important to choose a reliable database management system and reasonable Web connectivity tools for that system, but your site will not live or die by the choice. If you make the wrong choices, you will work harder at writing your programs and might have to buy a computer ten times larger than an efficiently-constructed site would need. But whether a site is worth using or not depends mostly on the data model, transactions, and form design.
The server logs of these MBAs would be a lot fatter if graphic design were the same thing as user interface design. You are building a program. The user interface happens to be Web pages. The links among the pages are part of the user interface. Whoever puts in those links has to understand whether the server can actually answer the query and, if so, how much crunching will be required.
Here's my theory on how to develop a database-backed Web site:
"We like the look."
Then I explained that full text search was slow. The database could find
all the ads in a category in about 1/20th of a second, but full text
search chewed up close to a full second of time on a $120,000
computer. And, by the way, we'd have to do the query twice. Once to fill
the space between
<NOFRAMES> tags for the Netscape 1.1 crowd. And
then again for the subwindows for users with frames-capable
browsers. The same expensive query, twice!
"We like frames."
If you can buy a $500 or $5,000 or $50,000 product, why bang your head against the wall trying to get through the next two chapters? Because there is no magic bullet. All the glossy brochures and PR budgets in the world can't disguise the fact that nobody has solved the automatic programming problem. There is no program that can take an English description of what you want done and turn that into software.
The fact that technology doesn't work is no bar to success in the marketplace, though, and tools that purport to automate programming have been selling well for decades. The worst of these tools simply don't work. You'd have been better off using EDSAC machine code. (The Electronic Delay Storage Automatic Computer was a vacuum-tube machine built in 1949, the first computer in which the program instructions were stored in memory along with the data.) The best of these tools are sort of like Lotus 1-2-3.
If your problem is very simple, it is much easier to code it up in a spreadsheet than it is to write a program from scratch. The spreadsheet generates the user interface. You just enter some rules for how cells relate to each other; the spreadsheet figures out in which order to perform the computations.
After a while, though, you become dissatisfied with your program. It looks and feels, well, like a spreadsheet. Rather than ask the user to just look at the row headings and type the data into the correct cells, you'd like a little series of interview dialog boxes. When the user gets through filling them all out, your little program will stuff the spreadsheet appropriately.
Only now do you find that the spreadsheet macro language will not let you do two or three of the things that you desperately need to do. Furthermore, documentation of what the language can and can't do is scant so you spend days looking for clever ways to accomplish your objective. Finally, you give up and start from scratch with a standard programming language.
Understand this phenomenon requires a little bit of history. Grizzled old hackers tell of going into insurance companies in the 1960s. The typical computer cost at least $500,000 and held data of great value. When Cromwell & Jeeves Insurance needed custom software, they didn't say, "Maybe we can save a few centimes by hiring a team of guys in India." They hired the best programmers they could find from MIT and didn't balk at paying $10,000 for a week of hard work. Back in those days, $10,000 was enough to hire a manager for a whole year, a fact not lost on managers who found it increasingly irksome.
Managers control companies, and hence policies that irk managers tend to be curtailed. Nowadays, companies have large programming staffs earning, in real dollars, one-third of what good programmers earned in the 1960s. When even that seems excessive, work is contracted out to code factories in India. Balance has been restored. Managers are once again earning three to ten times what their technical staff earn. The only problem with this arrangement is that most of today's working programmers don't know how to program.
Companies turn over projects to their horde of cubicle-dwelling C-programming drones and then are surprised when, two years later, they find only a tangled useless mess of bugs and a bill for $3 million. This does not lead companies to reflect on the fact that all the smart people in their college class went to medical, law, or business school. Instead, they embark on a quest for tools that will make programming simpler. Consider the case of Judy CIO who is flying off to meet with the executives at Junkware Systems. Judy will book her airplane ticket using a reliable reservation system programmed by highly-paid wizards in the 1960s. There is no middleware in an airline reservation system. There is no Microsoft software. There is no code written by C drones. Just one big IBM mainframe.
Judy changes planes in the new Denver airport. She could reflect on the fact that the airport opened a couple of years late because the horde of C programmers couldn't make the computerized baggage handling system work (it was eventually scrapped). She could reflect on the fact that the air traffic controllers up in the tower are still using software from the 1960s because the FAA can't get their new pile of C code to workbillions of dollars, 15 years, and acres of cubicles stuffed with $50,000-per-year programmers wasn't good for much besides a lot of memory allocation bugs. She could compare the high programmer salaries of the past and their still-working software to the low programmer salaries of the present and their comprehensive collection of bloated bug-ridden ready-any-year-now systems. However, these kinds of reflections aren't very productive for a forward-looking CIO. Judy uses her time at the airport to catch up on what passes for literature among MBAs: The Road Ahead and Dollar Signs : An Astrological Guide to Personal Finance.
Note: That last title is not made up. Here's the synopsis from amazon.com: "Financial astrology is used by many prominent investment houses to analyze and predict markets. Now, Yvonne Morabito, Penthouse's "Cosmic Cashflow" columnist and a guest correspondent on financial astrology on CNBC, shows how anyone can use financial astrology to achieve greater success with this quick-reference guide." Our friend Olin likes to say "I don't myself believe in astrology. However, I think that's because I'm a Libra and Libras are always skeptical."
Judy CIO rushes from her last flight segment into the Junkware Systems demo room where their $100,000-per-year marketing staff will explain why the Junkware 2003 system, programmed by Junkware's cubicle drones, will enable her cubicle drones to write software ten times faster and more reliably. Judy rushes to take notes on this exciting product but becomes flustered when her Handspring Treo crashes. . . .
If you despair of understanding the profound mysteries of the preceding SQL statement, Junkware 2003 is for you. Maybe the program will interview you and learn that you want a two-column table, the first column of which should be called "email" and store a string, the second of which should be called "name" and also store a string. Then Junkware 2003 will crank out the above SQL code.
create table mailing_list ( email varchar(100), name varchar(100) );
But Junkware 2003 can't know that your organization will need to produce a Web page from this table sorted by last name. That means that you really need to store first and last names in separate columns. Junkware 2003 can't know that your applications programs will fail if there are two entries with the same e-mail address. Somewhere during the data modeling process, you have to be thoughtful enough to tell SQL or Junkware 2003 to constrain the email column to be unique.
"As we enter the 21st century we find that rifle marksmanship has been largely lost in the military establishments of the world. The notion that technology can supplant incompetence is upon us in all sorts of endeavors, including that of shooting."Any interesting Web/RDBMS problem can be solved using just the standard software shipped with the RDBMS. If you can't solve it with those tools, you can't solve it with any tool no matter how glitzy and turnkey. The critical element is not the tools, it is the programmer.
-- Jeff Cooper in The Art of the Rifle (1997; Paladin Press)
In terms of maintainability, clarity of code, software development cycle, and overall Web server performance and reliability, the simplest tools such as Microsoft Active Server Pages are superior to virtually any of the expensive Web development systems out there.
A substantial portion of your investment in Web site programming can be wasted if you choose a middleware approach. The most expensive must-have tools of the 1990s are all history now. People can't remember the companies or the products. Yet organizations are still running sites that were developed with these tools. Their now-dead vendor saddled them with either a new computer language that the vendor created or a non-standard set of library calls from an existing computer language. Either way it amounts to the same thing: the legacy site is in a "Middle Webbish" language that no modern programmer can understand. The software for parsing Middle Webbish was never open source and therefore even if an organization had the inclination it could not hire a programmer to fix bugs in the underlying system.
Occasionally a product is sufficiently good and sufficiently stable and sufficiently likely to last that it is worth the dependency risk. But don't take that risk simply because you are afraid to think and program. Afraid or not, you will eventually have to think and program.
Steel yourself against the marketing assaults of junkware/middleware vendors by remembering that all the hard stuff is independent of specific technology and product choices. The next three chapters aren't all that technical. If you read them carefully, you won't turn into one of those people who are afraid to program and therefore go running into the arms of junkware peddlers.
If you despair of learning how to do anything productive, what you might have learned from this chapter is that you should work with the programmer and user interface designer to build the site that fits your publishing model before bringing in a graphic designer to make it pretty.
If none of the above seems inspiring, let's close with a quote from The Magic Mountain, possibly the most boring book ever written (writing a high quality computer book like this naturally gives one license to criticize winners of the Nobel Prize for Literature). Thomas Mann clearly knew just how dull these 700 pages would seem to an audience spoiled by the twentieth century because he wrote in the foreword:
"We shall tell it at length, thoroughly, in detail for when did a narrative seem too long or too short by reason of the actual time or space it took up? We do not fear being called meticulous, inclining as we do to the view that only the exhaustive can be truly interesting."