When is a Site Really a Database?

Ruby Beach. Olympic National Park (Washington State)

Chapter 10: When is a Site Really a Database?

by Philip Greenspun, part of Database-backed Web Sites

Note: this chapter has been superseded by its equivalent in the new edition

In Chapter 2 I presented four ways of standing tall on the Web:

1. Sites that provide traditional information. This is the type of site that requires the least imagination but also the most capital investment. Find bodies of information that consumers in the 1980s bought offline and sell them online. This includes movies/videos/television, newspapers, magazines, weather reports, and stock market information. Revenue comes from advertising, links to sites that do retail transactions and give you a kickback, and occasionally subscriptions.

2. Sites that provide collaboratively created information. This is information that was virtually impossible to collect before the Internet. A dead-trees example would be the Consumer Reports annual survey of automobile reliability. They collect information from their readers via mail-in forms, collate the results, and publish them once a year. The Internet makes this kind of activity less costly for the provider and provides much more immediate and deeper information for the user. Revenue comes from the same sources as in Category 1 but production expenses are lower.

3. Sites that provide a service via a server-side program. An example of this would be providing a wedding planning program. The user tells you how much he or she wants to spend, when and where the wedding is, who is invited, and so on. Your program then figures a detailed budget, develops an invitation list, and maintains gift and thank-you lists. You are then in a position to sell an ad to the Four Seasons hotel that will be delivered to couples getting married on June 25, who live less than 100 miles away, with fewer than 80 guests, who have budgeted more than $17,000.

4. Sites that define a standard that enables a consumer to seamlessly query multiple databases. For example, car dealers have computers managing their inventory, but that data is imprisoned on the dealers' computers and is unavailable to consumers in a convenient manner. Suppose you define a standard that allows the inventory computers inside car dealerships to download their current selection of cars, colors, and prices. You get the car dealers to agree to provide their information to you. Then your site becomes a place where a consumer can say "I want a new dark green Dodge Grand Caravan with air conditioning and antilock brakes that's for sale within 60 miles of zip code 02176." From your query to the dealers' multiple databases, your user can get a list of all the cars available that match their criteria, and can jump right to the relevant dealer's Web site.

If you are trying to get rich and famous under Category 1, providing traditional information on the Web, then your site isn't really a database except in a degenerate way. On the other hand, unless you have Time-Warner's budget to keep it up to date, you probably won't be standing all that tall.

Categories 2 through 4 are really databases disguised as Web servers. Information is collected in small structured chunks from a variety of simultaneous sources. It is distributed in response to queries that cannot be anticipated when the data are stored. You might not be an Oracle Achiever. You might not know how to spell "SQL". But if your site falls into categories 2 through 4 above then you are running a database.

The Hard Part

Brooks Falls, Katmai National Park (Alaska)

1. Develop a data model. What information are you going to store and how will you represent it?

2. Develop a collection of legal transactions on that model, e.g., inserts and updates.

3. Come up with a set of Web Forms that let the user build up to one of those transactions.

Suppose that you want to build a mailing list system for your site. You want users to be able to add and remove themselves. You need their e-mail addresses and you want their real names. You want to be the only one who can view the list and the only one who can send mail to the list.

Step 1: The Data Model

You've decided to store e-mail address and real name. If you are a Defense Department-funded Artificial Intelligence researcher then you might spend a few years studying this "knowledge representation" problem. Otherwise, you'd probably budget two variable-length ASCII strings, one for e-mail address and one for name.

This is the step at which you decide whether or not you are going to allow two entries with the same e-mail address, whether you're going to assign some kind of guaranteed-unique key to each entry, and whether to also store the time and date of the entry. When I built a system like this for myself, I decided that there wasn't any point in having two entries with the same e-mail address. So the e-mail address could serve as the key. I also chose not to record the time and date of the entry, but partly because I knew that my database management software was keeping this information anyway.

Step 2: Legal Transactions

Users have to be able to add themselves to the list. So one legal transaction is the insertion of a name/e-mail pair into the database. Users should also be able to remove themselves. In the system I built, the e-mail address per se is the database key so the removal transaction is "delete the entry for joe@bigcompany.com". I decided against having any update transactions. If a user changed his name or e-mail address, he could just remove the old entry and make a new one.

Step 3: Mapping Transactions onto Web Forms

The only way that users can interact with your database is via the Web. So you have to come up with ways for them to formulate legal transactions using HTML forms. If the transaction is complicated and early choices change the possibilities for subsequent choices, then you may need a series of forms. In the case of this mailing list system, you probably can get away with two forms. The entry form has e-mail and name TEXT inputs. The removal form has only an e-mail TEXT input.

That's it. You've done the hard part.

The Easy Part

Brown bear stripping the (fatty) skin off a live salmon; Brooks Falls, Katmai National Park (Alaska)

You only have one thing left to do: Write a couple of programs that parse the HTML forms and turn them into actual database transactions.

Note that it is only during the easy part that you have to think about the particular technologies you are using on your server. The user has no way of knowing whether you are running the latest release of Oracle or whether all the form submissions actually turn into e-mail to someone who manually updates a regular file. The site will still work the same as far as the user is concerned.

In practice, you will probably be using some kind of relational database (see the next chapter). So your programs will be turning HTML form submissions into SQL and results from the database back into HTML. It is important to choose the best brand of database management system and the best Web connectivity tools for that system, but your site will not live or die by the choice. If you make the wrong choices, you will work harder at writing your programs and might have to buy a computer 10 times larger than an efficiently-constructed site would need. But whether a site is worth using or not depends mostly on the data model, transactions, and form design.

Prototyping the Site, My Theory

You're building a database. You're modelling data from the real world. You're going to have to write computer programs in a formal language. You have to design a user interface for that computer program. If you have an MBA then your natural first step is . . . hire a graphic designer. After all, this computer stuff is confusing. Databases frighten you. What you really need is something that will look good at your next meeting. Graphic designers make pages that look good. You can always hire a programmer later to actually make the forms work.

The server logs of these MBAs would be a lot fatter if graphic design were the same thing as user interface design. You are building a program. The user interface happens to be Web pages. The links among the pages are part of the user interface. Whoever puts in those links has to understand whether the server can actually answer the query and, if so, how much crunching will be required.

Here's my theory on how a database Web site gets developed:

1. publisher/editor decides what service he wants to offer, e.g., "automatic survey of camera reliability for readers of photo.net"

2. programmer builds prototype with text-only user interface

3. publisher/editor critiques prototype

4. programmer refines user interface

5. graphic designer comes in to add graphics/layout, but without changing user interface

Prototyping the Site, The Reality

I was asked to "make a classified ad system using a relational database that looks and works like this set of static HTML files we've put together." They had a category for each ad but decided that it "looked cleaner" to make full text search the only option on the cover page. I tried to explain about the decades of information retrieval literature that demonstrated just how bad users were at formulating full text queries. Full text search was mainly useful when you weren't able to categorize. Since we had a big list of categories, why not make those primary and offer the full text search as an alternative?

"We like the look."

Then I explained that full text search was slow. The database could cough up all the ads in a category in about 1/20th of a second, but full text search chewed up close to a full second of time on a $120,000 computer. And, by the way, we'd have to do the query twice. Once to fill the space between <NOFRAMES> tags for the Netscape 1.1 crowd. And then again for the subwindows for users with frames-capable browsers. The same expensive query, twice!

"We like frames."

Do I Need a College Education To Understand Your System?

The next few chapters are going to get a little bit formal and technical. If you have been ingesting your daily quota of Web technology hype, then you know anyone can build a dynamic site without ever having to learn a programming language. Just fill out some forms, make a few strokes with your mouse, and presto, db-backed Web site. The $50,000 may seem steep but not when you compare it to the cost of a computer science education at Stanford.

If you can buy a $500 or $5,000 or $50,000 product, why bang your head against the wall trying to get through the next three chapters? Because there is no magic bullet. All the glossy brochures and PR budgets in the world can't disguise the fact that nobody has solved the automatic programming problem. There is no program that can take a English description of what you want done and turn that into software.

The fact that technology doesn't work is no bar to success in the marketplace, though, and tools that purport to automate programming have been selling well for decades. The worst of these tools simply don't work. You'd have been better off using EDSAC machine code (The Electronic Delay Storage Automatic Computer was a vacuum tube machine built in 1949, the first computer in which the program instructions were stored in memory along with the data). The best of these tools are sort of like Lotus 1-2-3.

If your problem is very simple, it is much easier to code it up in a spreadsheet than it is to write a program from scratch. The spreadsheet generates the user interface. You just enter some rules for how cells relate to each other; the spreadsheet figures out in which order to perform the computations.

After awhile, though, you become dissatisfied with your program. It looks and feels, well, like a spreadsheet. Rather than ask the user to just look at the row headings and type the data into the correct cells, you'd like a little series of interview dialog boxes. When the user got through filling them all out, your little program would stuff the spreadsheet appropriately.

Only now do you find that the spreadsheet macro language will not let you do two or three of the things that you desperately need to do. Furthermore, it isn't really well documented what the language can and can't do so you spend days looking for clever ways to accomplish your objective. Finally, you give up and start from scratch with a standard programming language.

Why Don't Customers Wise Up?

After three decades of shelling out for magic programming bullets that failed, you'd think that corporate managers would give up. Yet these products proliferate. Hope seems to spring eternal in the breasts of MBAs.

My personal theory requires a little bit of history. Grizzled old hackers tell of going into insurance companies in the 1960s. The typical computer cost at least $500,000 and held data of great value. When Cromwell & Jeeves Insurance needed custom software, they didn't say "maybe we can save a few centimes by hiring a team of guys in India". They hired the best programmers they could find from MIT and didn't balk at paying $10,000 for a week of hard work. Back in those days, $10,000 was enough to hire a manager for a whole year, a fact not lost on managers who found it increasingly irksome.

Managers control companies and hence policies that irk managers tend to be curtailed. Nowadays, companies have large programming staffs earning, in real dollars, one third of what good programmers earned in the 1960s. When even that seems excessive, work is contracted out to code factories in India. Balance has been restored. Managers are once again earning 3-10 times what their technical staff earn. The only problem with this arrangement is that most working programmers today don't know how to program.

Companies turn over projects to their horde of cubicle-dwelling C-programming drones and then are surprised when, two years later, they find only a tangled useless mess of bugs and a bill for $3 million. This does not lead companies to reflect on the fact that all the smart people in their college class went to medical, law, or business school. Instead, they embark on a quest for tools that will make programming simpler. A manager will book an airplane ticket using a reservation system programmed by highly-paid wizards in the 1960s, never thinking that it might fail. Then the flight will be delayed. The new Denver airport isn't open. The horde of C programmers is a couple of years late with the computerized baggage handling system (it was eventually scrapped). The air traffic controllers are still using the old software because the FAA's horde of $50,000 per year programmers has spent 15 years squelching each other's memory allocation bugs. When our manager gets off the delayed flight, she'll happily trundle to the Junkware Systems demo room where their $100,000 per year marketing staff will explain why the Junkware 2000 system, programmed by Junkware's cubicle drones, will enable her cubicle drones to write software 10 times faster and more reliably.

Is There A Better Way?

There is a better way: learn to program (or hire someone who already knows how). The Junkware 2000 systems of the world might save you a few keystrokes here and there, but they don't attack the fundamental difficulty of correctly executing the three hard steps of designing a DB-backed Web site. Here they are again:

1. Develop a data model. What information are you going to store and how will you represent it?

2. Develop a collection of legal transactions on that model, e.g., inserts and updates.

3. Come up with a set of Web Forms that let the user build up to one of those transactions.

The hard part of step 1, developing a data model, is not the putatively arcane syntax of SQL. The hard part is thinking sufficiently formally about all the system requirements that you can boil them down to tables and columns. Is it too hard to type

create table mailing_list (
        email           text,
        name            text
);

? If so, then Junkware 2000 is for you. Maybe it will interview you to learn that you want a two-column table, the first column of which should be called "email" and store a string, the second of which should be called "name" and also store a string. Then Junkware 2000 will crank out the above SQL code.

Fantastic!

But Junkware 2000 can't know that your organization will need to produce a Web page from this table sorted by last name. That means you really need to store first and last names in separate columns. Junkware 2000 can't know that your applications programs will only work if there is only one entry per email address. You have to tell SQL or Junkware 2000 to make sure that the email column is constrained to be unique.

Just Say No To Middleware

Any interesting Web/RDBMS problem can be solved using just the standard software shipped with the RDBMS. If you can't solve it with those tools, you can't solve it with any tool no matter how glitzy and turnkey. The critical element is not the tools, it is the programmer.

I wrote my first database-backed Web site in 1994 using Oraperl CGI scripts. Kevin Stock sat down for about two weeks and linked Perl 4 with the Oracle C libraries, adding about five new Perl functions in the process. He christened the new version "Oraperl" and distributed the source code for free.

In terms of maintainability, clarity of code, software development cycle, and overall Web server performance and reliability, Oraperl CGI scripts are better than any of the purpose-built Web/RDBMS tools that I have used with the exception of AOLserver and Meta-HTML. Would I cry if I had to give up the brilliantly designed AOLserver and go back to Oraperl and NCSA HTTPD 1.3? Not really. Using AOLserver probably only cuts 10% of the development time out of a big db-heavy project. It gives me about a 10X improvement in server throughput over a CGI-based approach, but it can't do thinking and software development for me.

Would I cry if I had to use the latest and greatest hyped tools? Absolutely. Oraperl, AOLserver, and Meta-HTML let me change a script in Emacs (standard Unix text editor) and the new version is immediately live on the Web. The total solution of Netscape LiveWire would force me to recompile my application to test any change, slowing my development effort. If were to shell out $70,000 for the complete Informix Universal Server and their award-winning Web Blade, I wouldn't be able to use Emacs anymore to write my code. I'd have to use Netscape Navigator as a text editor. I'd be forced to use a programming language even more obscure and less powerful than Perl. My pages would deadlock with each other. I'd be thrust into a hell of system administration trying to figure out what was wrong.

Summary

This chapter is intended to inspire you to read the next three. I want to inspire your contempt for poverty by reminding you that the most profitable sites on the Web are really databases. I want to inspire your contempt for technology by reminding you that all the hard stuff is independent of the product choices that you make. I want to inspire your contempt for managers in two ways: (1) by reminding you that a programmer/user interface designer should build the site and then let the manager and graphic designer tart it up; (2) by remind you that we don't need the FBI to figure out why most modern software doesn't work. I want to inspire your contempt for people who are afraid to program and run into the arms of junkware peddlers.

Read the next three chapters carefully. They aren't really all that technical. If you absorb them, your time will be worth $1250 per day.

If that isn't motivation enough, I'll close with a quote from The Magic Mountain, possibly the most boring book ever written (writing a high quality computer book like this naturally gives me license to criticize winners of the Nobel Prize for Literature). Thomas Mann clearly knew just how dull these 700 pages would seem to an audience spoiled by the 20th century because he wrote this in the foreword:

"We shall tell it at length, thoroughly, in detail – for when did a narrative seem too long or too short by reason of the actual time or space it took up? We do not fear being called meticulous, inclining as we do to the view that only the exhaustive can be truly interesting."

Note: If you like this book you can move on to Chapter 11.

philg@mit.edu