Sites that are really databases

Ruby Beach. Olympic National Park (Washington State)

Chapter 11: Sites that are really databases

by Philip Greenspun, part of Philip and Alex's Guide to Web Publishing

Revised July 2003

Brooks Falls, Katmai National Park (Alaska)

Armed with knowledge from the last chapter, a publisher might be tempted to rush into writing a big pile of server-side programs. It is worth stepping back from the technology for one chapter, however. First, we want to think about how technology works to achieve a publishing objective. Second, we want to see to what extent a Web publishing problem can be thought of as a database problem. Why? If we can think of our Web application as a database application, we can attack it with the database management software that industry has carefully developed since 1960.

Web applications that replace desktop applications, such as the WimpyPoint system described in the first chapter, are natural candidates for database management software. What about public sites? Let's revisit the four categories presented in Chapter 2:

Sites that provide traditional information. This is the type of site that requires the least imagination, but also the most capital investment. Find bodies of information that consumers in the 1980s bought offline and sell them online. This includes movies/videos/television, newspapers, magazines, weather reports, and stock market information. Revenue comes from advertising, links to sites that do retail transactions and give you a kickback, and occasionally subscriptions.
Sites that provide collaboratively created information. This is information that was virtually impossible to collect before the Internet. A dead-trees example would be the Consumer Reports annual survey of automobile reliability. CR collects information from its readers via mail-in forms, collates the results, and publishes them once a year. The Internet makes this kind of activity less costly for the provider and gives the user much more immediate and deeper information. Revenue comes from the same sources as in Category 1, but production expenses are lower.
Sites that provide a service via a server-side program. An example of this would be providing a wedding planning program. The user tells you how much he or she wants to spend, when and where the wedding is, who is invited, and so on. Your program then figures a detailed budget, develops an invitation list, and maintains gift and thank-you lists. You are then in a position to sell an ad to the Four Seasons Hotel that will be delivered to couples getting married on June 25th, who live less than 100 miles away, invite fewer than 80 guests, and have budgeted more than $17,000.
Sites that define a standard that enables a consumer to seamlessly query multiple databases. For example, car dealers have computers managing their inventory, but those data are imprisoned on the dealers' computers and are unavailable to consumers in a convenient manner. Suppose you define a standard that allows the inventory computers inside car dealerships to download their current selection of cars, colors, and prices. You get the car dealers to agree to provide their information to you. Then your site becomes a place where a consumer can say "I want a new dark green Dodge Grand Caravan with air conditioning and antilock brakes that's for sale within 60 miles of zip code 02176." From your query to the dealers' multiple databases, your user can get a list of all the cars available that match their criteria, and can jump right to the relevant dealer's Web site.

A Category 1 site, providing traditional information on the Web, doesn't usually start out as a database except in a degenerate way. However when all of the collaboration and personalization systems that most publishers eventually want are added, a growing percentage of the site will look like a database.

Categories 2 through 4 are databases disguised as Web servers. Information is collected in small structured chunks from a variety of simultaneous sources. It is distributed in response to queries that cannot be anticipated when the data are stored. If your site falls into categories 2 through 4 above, you are running a database even if you can't spell "SQL".

As long as one is running a database, one might as well do it right. Building a database-backed Web site requires a hard part that is heavy on the thinking and an easy part that is heavy on the implementation. As an example, let's suppose that our publishing idea is to build a mailing list system for an otherwise static Web site. We want users to be able to add and remove themselves. We need their e-mail addresses and real names. We want only the publisher to be the only person who can view the list and also the only one who can send mail to the list.

The Hard Part

We have to take our publishing idea (the mailing list) and apply three standard steps:

Develop a data model. What information are you going to store and how will you represent it?
Develop a collection of legal transactions on that model, e.g., inserts and updates.
Come up with a set of Web forms that let the user build up to one of those transactions.

Step 1: The Data Model

You've decided to store e-mail address and real name. If you are a Defense Department-funded Artificial Intelligence researcher, you might spend a few years studying this "knowledge representation" problem. Otherwise, you'd probably budget two variable-length ASCII strings, one for e-mail address and one for name.

This is the step at which you decide whether or not you are going to allow two entries with the same e-mail address, whether you're going to assign some kind of guaranteed-unique key to each entry, and whether to also store the time and date of the entry. Received wisdom in the database world is that every table should have a synthetically generated key yet in this case it is hard to understand what point would be served in having two entries with the same e-mail address. Thus it might be acceptable to use e-mail address as the key.

Step 2: Legal Transactions

Users have to be able to add themselves to the list, so one legal transaction is the insertion of a name/e-mail pair into the database. Users should also be able to remove themselves. If we've been using e-mail address as the database key the removal transaction is "delete the entry for joe@bigcompany.com". To simplify the user interface, we can decide against having any update transactions. If a user changes his name or e-mail address, he can remove the old entry and make a new one.

Step 3: Mapping Transactions onto Web Forms

The only way that users can interact with your database is via the Web, so you have to come up with ways for them to formulate legal transactions using HTML forms. If the transaction is complicated and early choices change the possibilities for subsequent choices, you may need a series of forms. In the case of this mailing list system, you probably can get away with two forms. The entry form has e-mail and name TEXT inputs. The removal form has only an e-mail TEXT input.

That's it. You've done the hard part.

The Easy Part

Brown bear stripping the (fatty) skin off a live salmon; Brooks Falls, Katmai National Park (Alaska)

You have only one thing left to do: Write a couple of programs that parse the HTML forms and turn them into actual database transactions.

Note that it is only during the easy part that you have to think about the particular technologies you are using on your server. The user has no way of knowing whether you are running the latest release of Oracle or whether all the form submissions actually turn into e-mail to someone who manually updates a regular file. The site will still work the same as far as the user is concerned.

In practice, you will probably be using some kind of relational database (see the next chapter). So your programs will be turning HTML form submissions into SQL and results from the database back into HTML. It is important to choose a reliable database management system and reasonable Web connectivity tools for that system, but your site will not live or die by the choice. If you make the wrong choices, you will work harder at writing your programs and might have to buy a computer ten times larger than an efficiently-constructed site would need. But whether a site is worth using or not depends mostly on the data model, transactions, and form design.

Prototyping the Site, My Theory

You're building a database. You're modelling data from the real world. You're going to have to write computer programs in a formal language. You have to design a user interface for that computer program. If you had an MBA then your natural first step would be . . . hire a graphic designer. After all, this computer stuff is confusing. Databases frighten you. What you really need is something that will look good at your next meeting. Graphic designers make pages that look good. You can always hire a programmer later to actually make the forms work.

The server logs of these MBAs would be a lot fatter if graphic design were the same thing as user interface design. You are building a program. The user interface happens to be Web pages. The links among the pages are part of the user interface. Whoever puts in those links has to understand whether the server can actually answer the query and, if so, how much crunching will be required.

Here's my theory on how to develop a database-backed Web site:

Publisher/editor decides what service he wants to offer, e.g., "automatic survey of camera reliability for readers of photo.net".
Programmer builds prototype with text-only user interface.
Publisher/editor critiques prototype.
Programmer refines user interface.
Graphic designer comes in to add graphics/layout, but without changing user interface.

Prototyping the Site, The Reality

Nuclear power plant cooling tower. Oswego, New York.

Back in 1995, I was asked to "make a classified ad system using a relational database that looks and works like this set of static HTML files we've put together." They had a category for each ad but decided that it "looked cleaner" to make full text search the only option on the cover page. I tried to explain about the decades of information retrieval literature that demonstrated just how bad users were at formulating full text queries. Full text search was mainly useful when you weren't able to categorize. Since we had a big list of categories, why not make those primary and offer the full text search as an alternative?

"We like the look."

Then I explained that full text search was slow. The database could find all the ads in a category in about 1/20th of a second, but full text search chewed up close to a full second of time on a $120,000 computer. And, by the way, we'd have to do the query twice. Once to fill the space between <NOFRAMES> tags for the Netscape 1.1 crowd. And then again for the subwindows for users with frames-capable browsers. The same expensive query, twice!

"We like frames."

Do I Need a College Education To Understand Your System?

The next few chapters are going to get a little bit formal and technical. If you have been ingesting your daily quota of Web technology hype, then you know that anyone can build a dynamic site without ever having to learn a programming language. Just fill out some forms, make a few strokes with your mouse, and presto: db-backed Web site. The $50,000 may seem steep, but not when you compare it to the cost of a computer science education at Stanford.

If you can buy a $500 or $5,000 or $50,000 product, why bang your head against the wall trying to get through the next two chapters? Because there is no magic bullet. All the glossy brochures and PR budgets in the world can't disguise the fact that nobody has solved the automatic programming problem. There is no program that can take an English description of what you want done and turn that into software.

The fact that technology doesn't work is no bar to success in the marketplace, though, and tools that purport to automate programming have been selling well for decades. The worst of these tools simply don't work. You'd have been better off using EDSAC machine code. (The Electronic Delay Storage Automatic Computer was a vacuum-tube machine built in 1949, the first computer in which the program instructions were stored in memory along with the data.) The best of these tools are sort of like Lotus 1-2-3.

If your problem is very simple, it is much easier to code it up in a spreadsheet than it is to write a program from scratch. The spreadsheet generates the user interface. You just enter some rules for how cells relate to each other; the spreadsheet figures out in which order to perform the computations.

After a while, though, you become dissatisfied with your program. It looks and feels, well, like a spreadsheet. Rather than ask the user to just look at the row headings and type the data into the correct cells, you'd like a little series of interview dialog boxes. When the user gets through filling them all out, your little program will stuff the spreadsheet appropriately.

Only now do you find that the spreadsheet macro language will not let you do two or three of the things that you desperately need to do. Furthermore, documentation of what the language can and can't do is scant so you spend days looking for clever ways to accomplish your objective. Finally, you give up and start from scratch with a standard programming language.

Why Don't Customers Wise Up?

After three decades of shelling out for magic programming bullets that fail, you'd think that corporate managers would wise up. Yet these products proliferate. Hope seems to spring eternal in the breasts of MBAs.

Understand this phenomenon requires a little bit of history. Grizzled old hackers tell of going into insurance companies in the 1960s. The typical computer cost at least $500,000 and held data of great value. When Cromwell & Jeeves Insurance needed custom software, they didn't say, "Maybe we can save a few centimes by hiring a team of guys in India." They hired the best programmers they could find from MIT and didn't balk at paying $10,000 for a week of hard work. Back in those days, $10,000 was enough to hire a manager for a whole year, a fact not lost on managers who found it increasingly irksome.

Managers control companies, and hence policies that irk managers tend to be curtailed. Nowadays, companies have large programming staffs earning, in real dollars, one-third of what good programmers earned in the 1960s. When even that seems excessive, work is contracted out to code factories in India. Balance has been restored. Managers are once again earning three to ten times what their technical staff earn. The only problem with this arrangement is that most of today's working programmers don't know how to program.

Companies turn over projects to their horde of cubicle-dwelling C-programming drones and then are surprised when, two years later, they find only a tangled useless mess of bugs and a bill for $3 million. This does not lead companies to reflect on the fact that all the smart people in their college class went to medical, law, or business school. Instead, they embark on a quest for tools that will make programming simpler. Consider the case of Judy CIO who is flying off to meet with the executives at Junkware Systems. Judy will book her airplane ticket using a reliable reservation system programmed by highly-paid wizards in the 1960s. There is no middleware in an airline reservation system. There is no Microsoft software. There is no code written by C drones. Just one big IBM mainframe.

Judy changes planes in the new Denver airport. She could reflect on the fact that the airport opened a couple of years late because the horde of C programmers couldn't make the computerized baggage handling system work (it was eventually scrapped). She could reflect on the fact that the air traffic controllers up in the tower are still using software from the 1960s because the FAA can't get their new pile of C code to work—billions of dollars, 15 years, and acres of cubicles stuffed with $50,000-per-year programmers wasn't good for much besides a lot of memory allocation bugs. She could compare the high programmer salaries of the past and their still-working software to the low programmer salaries of the present and their comprehensive collection of bloated bug-ridden ready-any-year-now systems. However, these kinds of reflections aren't very productive for a forward-looking CIO. Judy uses her time at the airport to catch up on what passes for literature among MBAs: The Road Ahead and Dollar Signs : An Astrological Guide to Personal Finance.

Note: That last title is not made up. Here's the synopsis from amazon.com: "Financial astrology is used by many prominent investment houses to analyze and predict markets. Now, Yvonne Morabito, Penthouse's "Cosmic Cashflow" columnist and a guest correspondent on financial astrology on CNBC, shows how anyone can use financial astrology to achieve greater success with this quick-reference guide." Our friend Olin likes to say "I don't myself believe in astrology. However, I think that's because I'm a Libra and Libras are always skeptical."

Judy CIO rushes from her last flight segment into the Junkware Systems demo room where their $100,000-per-year marketing staff will explain why the Junkware 2003 system, programmed by Junkware's cubicle drones, will enable her cubicle drones to write software ten times faster and more reliably. Judy rushes to take notes on this exciting product but becomes flustered when her Handspring Treo crashes. . . .

Is There a Better Way?

There is a better way: Learn to program (or hire someone who already knows how). The Junkware 2003 systems of the world might save you a few keystrokes here and there, but they don't attack the fundamental difficulty of correctly executing the three hard steps of designing a db-backed Web site. Here they are again:

Develop a data model. What information are you going to store and how will you represent it?
Develop a collection of legal transactions on that model, e.g., inserts and updates.
Come up with a set of Web forms that let the user build up to one of those transactions.

Throughout these steps, the difficulty is never the putatively arcane syntax of SQL or HTML, both of which can be learned in a few days. The difficulty is thinking sufficiently formally about all the system requirements that you can boil them down to table and column definitions. The difficulty is in understanding the implications of adding and removing information from tables. The difficulty is in coming up with a user interface sufficiently perspicuous that a busy and unmotivated user won't get stuck halfway through a transaction. If you are capable of that kind of careful thinking, you ought to be able to type

create table mailing_list (
        email           varchar(100),
        name            varchar(100)
);

If you despair of understanding the profound mysteries of the preceding SQL statement, Junkware 2003 is for you. Maybe the program will interview you and learn that you want a two-column table, the first column of which should be called "email" and store a string, the second of which should be called "name" and also store a string. Then Junkware 2003 will crank out the above SQL code.

Fantastic!

But Junkware 2003 can't know that your organization will need to produce a Web page from this table sorted by last name. That means that you really need to store first and last names in separate columns. Junkware 2003 can't know that your applications programs will fail if there are two entries with the same e-mail address. Somewhere during the data modeling process, you have to be thoughtful enough to tell SQL or Junkware 2003 to constrain the email column to be unique.

Just Say No To Middleware

"As we enter the 21st century we find that rifle marksmanship has been largely lost in the military establishments of the world. The notion that technology can supplant incompetence is upon us in all sorts of endeavors, including that of shooting."
-- Jeff Cooper in The Art of the Rifle (1997; Paladin Press)

Any interesting Web/RDBMS problem can be solved using just the standard software shipped with the RDBMS. If you can't solve it with those tools, you can't solve it with any tool no matter how glitzy and turnkey. The critical element is not the tools, it is the programmer.

In terms of maintainability, clarity of code, software development cycle, and overall Web server performance and reliability, the simplest tools such as Microsoft Active Server Pages are superior to virtually any of the expensive Web development systems out there.

A substantial portion of your investment in Web site programming can be wasted if you choose a middleware approach. The most expensive must-have tools of the 1990s are all history now. People can't remember the companies or the products. Yet organizations are still running sites that were developed with these tools. Their now-dead vendor saddled them with either a new computer language that the vendor created or a non-standard set of library calls from an existing computer language. Either way it amounts to the same thing: the legacy site is in a "Middle Webbish" language that no modern programmer can understand. The software for parsing Middle Webbish was never open source and therefore even if an organization had the inclination it could not hire a programmer to fix bugs in the underlying system.

Occasionally a product is sufficiently good and sufficiently stable and sufficiently likely to last that it is worth the dependency risk. But don't take that risk simply because you are afraid to think and program. Afraid or not, you will eventually have to think and program.

Summary

This chapter was intended to inspire you to read the next two. Should Chapter 2 have failed to inspire a contempt of lucre, it is worth reiterating that the most profitable sites on the Web are really databases.

Steel yourself against the marketing assaults of junkware/middleware vendors by remembering that all the hard stuff is independent of specific technology and product choices. The next three chapters aren't all that technical. If you read them carefully, you won't turn into one of those people who are afraid to program and therefore go running into the arms of junkware peddlers.

If you despair of learning how to do anything productive, what you might have learned from this chapter is that you should work with the programmer and user interface designer to build the site that fits your publishing model before bringing in a graphic designer to make it pretty.

If none of the above seems inspiring, let's close with a quote from The Magic Mountain, possibly the most boring book ever written (writing a high quality computer book like this naturally gives one license to criticize winners of the Nobel Prize for Literature). Thomas Mann clearly knew just how dull these 700 pages would seem to an audience spoiled by the twentieth century because he wrote in the foreword:

"We shall tell it at length, thoroughly, in detail – for when did a narrative seem too long or too short by reason of the actual time or space it took up? We do not fear being called meticulous, inclining as we do to the view that only the exhaustive can be truly interesting."

The Mythical Man-Month (Brooks 1995; Addison-Wesley) is the all-time classic work on software engineering and why it is so expensive and prone to delays. Brooks's primary mine of information is the System 360 project at IBM in the 1960s, which gave us the modern mainframe.
Code Complete (McConnell 1993; Microsoft Press) and Rapid Development : Taming Wild Software Schedules (McConnell 1996; Microsoft Press) document the 1990s equivalents to the IBM 360 project.

or move on to Chapter 12: Database Management Systems

philg@mit.edu

Add a comment | Add a link