Tug of war.  Seattle, Washington.

Chapter 3: Scalable Systems for Online Communities

by Philip Greenspun, part of Philip and Alex's Guide to Web Publishing

Revised June 2003

As a society gets richer and better-equipped with machines, people spend less time grubbing out the basics of food and shelter and more time on education. Some of this time is spent in a setting that everyone can recognize as educational, e.g., a college classroom. Most education, however, occurs in non-traditional settings.

Our media does not portray the Michigan Militia (michiganmilitia.com) as a primarily educational institution. Yet a new member must learn where and when to meet, a body of Constitutional law, field communication skills, firearm safety, and marksmanship. To rise in the organization, a member must learn how to lead and educate other members.

Suppose that you decide to adopt a dog. You have to learn about the characteristics of different breeds. After choosing a breed, you have to learn about good breeders in your region. After choosing a puppy, you have to learn about training and learn about good vets in your city. You have to learn what brand of dog food is best and where to buy it. You have to learn where it is safe and legal to let your dog off the leash so that he can run and play with other dogs. Virtually all of this education will happen through informal contacts with more experienced dog owners, none of whom will set up a classroom or expect to be paid.

If you go to work as a computer programmer in a big company, the more experienced employees will have to show you where to find the water cooler, explain to you the significance of the project on which you're working, tell you how much of the work has been done so far, teach you how to use the software development tools, and demonstrate the fine points of the relational database management system on which the system you're building relies.

What is a Community?

What common features can we extract from the above examples? A community is a group of people with varying degrees of expertise in which the experts attempt to help the novices improve their skills.

This definition embraces the traditional physical university. Professors and Ph.D. students work with undergraduates to help them learn enough to graduate. This definition is not large enough to embrace a physical small town or city neighborhood, which is what most people usually mean when they use the word. Newcomers to a residential community will need to learn how to get to the supermarket but otherwise are not likely to be pursuing a productive goal in common with other residents.

Why Would We Want Online Communities?

If you've ever strolled among the beautiful buildings of Oxford or been awed by Manhattan's skyscrapers, you might ask why anyone would want an online community. If we can have the real thing, why settle for an ersatz electronic version?

One answer is that not everyone can have the real thing. Many people wish to learn who cannot afford Ivy League tuition. Many people wish to learn who cannot afford to stop working for four years. Many people wish to learn whose responsibilities or disabilities prevent them from traveling to a university campus.

Companies can have pretty much whatever they want. Certainly they have plenty of money to build lavish offices. Yet isn't there something odd about a workaday world at the turn of the millennium that Bartleby the Scrivener (1853) would find utterly familiar? Workers come in from their homes each morning to settle into individual offices where they find the paper documents necessary for their work. With better technology and management techniques, perhaps it would be possible to benefit from contributions by part-time workers or workers who don't leave their houses. Perhaps projects could be finished sooner by workers cooperating in rooms devoted to the project rather than isolated in offices mostly devoted to storing paper documents from previously completed projects.

If you still feel that physical communities must always be superior to electronically linked communities, let me ask you to ponder three words: junior high school.

Junior high school throws people together who have nothing in common besides parents who chose to locate in a particular neighborhood. Unless you're very adaptable, it is tough to find good friends. High school is more or less the same idea, but the pool of people is usually larger so it is more probable that kids with uncommon interests will find soulmates. In college, not only is the pool larger but there can be a concentration by personality type. Nerds find each other at Caltech and MIT; hippies find each other at Bard and Reed; snobs find each other at Harvard and Princeton; skiers find each other at state schools in Colorado and Vermont. When students graduate and go to work, they usually don't make as many friends. They aren't meeting as many people and the common thread of "do not want to starve in street" doesn't tie them very tightly to other workers.

What we can infer from this is that people make the best friends when the pool is large and the interests are common. Enter the Internet, which affords instant communication among millions of people worldwide. It isn't possible to find a pool on a comparable scale except in the world's largest cities. Given the Internet's raw communication capability and huge pool of potential friends, if you want to make a really great friend you just need a means of finding someone who shares your interests and then a means of collaborating with him or her.

To summarize, here are the new things that we can do with online communities:

Anyway, these are the things we could theoretically do with online communities. In practice, we have to do some programming work first if the communities are to remain useful as they grow. The evolution of public communities is instructive.

Evolution of Public Communities

Lower Klamath National Wildlife Refuge
The Ancient World
People got information from personal communication and groups meeting face-to-face. The influence of government and commercial interests on information was limited.

The Modern World
With the invention of movable type (popularly credited to Johannes Gutenberg in 1450, whose printing system incorporated a number of practical refinements, but actually Pi Sheng, a Chinese alchemist, was using movable type in 1041), information became susceptible to government and commercial control. The mass media exclude information that will offend advertisers. Governments have powerful systems with which to distribute propaganda.

The Early Internet
Most of the information that users got from the ARPAnet and early Internet was personal communication. Users got personal e-mail letters, mailing list letters, discussion group postings, documents written by individuals working without a publisher, and computer programs that expressed individual ideas. There was no advertising. There was little or no participation by major commercial interests. People reading a USENET discussion of Chrysler versus Toyota cars would get information from owners and none from the manufacturers.

Refinery. Richmond, California.
Internet Circa 2003
With the Web, the Internet finally became comprehensible to corporate PR departments. The best organized and most heavily used Web services are thoroughly corrupted by banner advertising and kickback arrangements. Some of the bones of the early Internet are still visible but they haven't scaled well. In the old days, you could read about Chrysler versus Toyota in the USENET discussion group rec.autos. Now, it isn't really clear where that discussion should go. rec.autos is no longer a group; it is the top of a hierarchy. You could read rec.autos.makers.chrysler, but you might get tired of wading through the 100+ postings a day, especially as much of it might be spam from auto dealers or spam from generic commercial advertisers. With nobody to organize the content, it is unlikely to be very useful. You're more likely to do your research at autos.yahoo.com, a beautifully organized service admittedly, but one whose content consists of information from commercial sources punctuated by banner ads.

What do we conclude from these observations? Technology profoundly affects the type of community that can be sustained and the extent to which information flows from few to many or from many to many.

The Big Problem

Vermont Thousands of people are operating public community-style Web services. Virtually all of them are using simple standalone software packages to handle things like discussion forums or classified ads. When one of these sites becomes popular, the publisher begins to devote 80 hours a week of free labor to moderating discussions, weeding out redundant classified ads, deleting alerts for users whose e-mail addresses have evaporated, answering questions from the confused, keeping content up-to-date, developing new content in response to user questions, etc. The beautiful thing about this is that so many people are willing to devote 80 unpaid hours a week toward helping their fellow human being. The ugly thing about this is that 80 hours a week turns out not to be enough.

Site growth can outstrip the capacity of any person, no matter how dedicated or efficient.

Vermont

One typical reaction on the part of the publisher is to turn the formerly non-commercial site into a showcase for whoredom. Users return to find six banner ads on the home page and links to kickback-paying referral partners obscuring content that had formerly been highlighted in relation to its utility. With all of the money flowing in, at least the publisher's scaling problems are history. More users means more page loads means more banner ads served means more revenue. The publisher can hire a discussion forum moderator and a customer service staff. Money can be used to hire writers and a webmaster to organize their contributions. The content may be bland and tainted by commercial interests, but at least the publisher is making a fat profit.... Oops! In practice, nearly all commercial community site publishers are losing money because the cost per user to maintain the site is too high.

Corporate intranet communities also need to scale. It really would be sad to have to hire a new moderator, webadmin, or sysadmin for every new employee. Yet the intranet community should be as vital as any public community site. If an employee sends another employee private e-mail asking a how-to question, that should be regarded as a failure of the intranet community software. Why wasn't it more efficient for these folks to collaborate using a Web service that would then archive the discussion?

Based on the author's experience with hundreds of Internet applications, it seems that all successful sites share the following six required elements for a sustainable online community:

  1. magnet content authored by experts
  2. means of collaboration
  3. powerful facilities for browsing and searching both magnet content and contributed content
  4. means of delegation of moderation
  5. means of identifying members who are imposing an undue burden on the community and ways of changing their behavior and/or excluding them from the community without them realizing it
  6. means of software extension by community members themselves
If we can provide these elements in some fashion we will have an application that is reasonable effective for users. If not, all the world's fanciest technology will be of little comfort.

The Big Solution

Adin California. The big solution is a configurable set of software modules that will
  1. keep a database of users, how to contact them, and how private they want their personal information kept
  2. keep a database of site content, who contributed it, and how each piece relates to the others
  3. keep track of which users have looked at which pieces of content
  4. keep track of which users are costing the community time and money
  5. keep track of how users are coming into the site and which external links they are selecting (clickthroughs)
  6. if a commercial site, keep track of which advertisers' banner ads have been served and to whom and whether or not they were effective
  7. help the site maintainers keep in contact with different classes of users
What are some of the ultimate goals of having all of these software modules installed and set up? Consider how they could be applied day-to-day at the photo.net online community: A seven-module software system that offers the preceding capabilities is going to be expensive to design, expensive to program, and expensive to maintain. Our new innovative community software system will require an underlying relational database management system that, though not innovative, also tends to be expensive to purchase and maintain (more on that in the "Choosing a Relational Database" chapter).

Before any Web publisher contemplates running an online community, it is probably worth stepping back to ask which components of the software should be built, which developed cooperatively with other publishers, and which can and should be purchased off-the-shelf.

Buy or Build?

Cedars Market.  Adin, California Someone who wants a community site with at least the basic capabilities above has to do the following:
  1. choose Windows or Unix
  2. choose a relational database management system (RDBMS)
  3. choose a Web/DB integration tool
  4. write SQL data models
  5. design a user interface to the legal transactions
  6. write dynamic pages that pull information out and stuff data into SQL tables (i.e., implement the user interface to the transactions)

Monkeys conferring.  Audubon Zoo.  New Orleans, Louisiana. Suppose that the new publisher makes all of these decisions correctly. It will still take six months and $500,000 in programmer time to reproduce the capabilities of online communities that were up and running in the mid-1990s. In fact, in terms of management attention, hourly wages, and lost time-to-market, it generally costs at least $100,000 just to make the operating system/RDBMS/Web tool decision: managers who don't know SQL will sit in meetings with salesmen who don't know SQL, trying to figure out whether Oracle, Microsoft SQL Server, or DB2 is best.

Bottom line: a fortune will be spent on programming; schedules will slip; the users will get reamed by all the bugs; if the site survives it will be so expensive to maintain that all the maintenance jobs will have to be exported to the Third World . Just as with any other custom software development effort.

Will this always be true? No. The basic argument of this chapter is that the Web server-side software industry's development has and will continue to recapitulate the development of the business data processing software industry.

History of business data processing

Elephants photographed through grass, as though on safari In the 1960s, people who needed to do business data processing bought "iron" (mainframe computer hardware). They hired programmers to write what today we would call a database management system. The same programmers would then build data models and application programs to put data into and pull data out of those data models.

In the 1970s, people would buy a commercial database management system plus some kind of iron on which to run it. They were tired of suffering with bugs in their programmers' ad hoc database management schemes and figured that their data storage needs weren't any different from hundreds of other computer users. Company programmers were used now only to write data models and application programs.

In the 1990s, people buy an enterprise software system such as SAP or Oracle Financials. They then buy an RDBMS to support it. They finally buy some iron to support the RDBMS. Company programmers are used only for some customization of the canned data models and apps.

Note that over these 40 years, there has been a huge transfer of power from iron vendors to data model/app vendors. Savvy companies realized this and adapted. IBM, for example, went heavily into the DBMS software business and then into business apps. On some days, you can go to the Oracle Web site and never learn that they make an RDBMS; the whole front page is given over to promoting the packaged business applications that they also sell. Not-so-savvy makers of iron or DBMS software have been nearly destroyed, e.g., Digital and Sybase.

Why the big shift from custom programming to packaged apps?

South Island, New Zealand Given that business managers pride themselves on being innovative, why the huge rush to have everyone using the same handful of enterprise software systems? Wouldn't they be better off with custom-written programs that have a lot of embedded knowledge about their particular products and the way that they like to deal with customers or vendors? Possible Answer 1 is that popular programming tools aren't really any better than they were in the 1960s but system requirements are more complex, thus making custom software development more expensive and perilous. Possible Answer 2 is that managers have figured out how to pay programmers so little that only stupid or lazy people are willing to work these days as programmers, thus making custom software development more expensive and perilous. Possible Answer 3 is that business managers are in fact no more innovative than a herd of sheep. They all read the same business literature. They all think about purchasing, invoicing, and payroll in exactly the same way. So they'd no more write custom information systems for their company than they would write a custom word processor.

Maybe it is a combination of all three. In any case, if your company handles payroll exactly the same way that Wombley's Widgets, Inc. does and they have a working program to do it, then you might as well use Wombley's software. If you hire programmers to build it from scratch, the best case is that you'll spend some money and get a working system. In the expected case, you'll spend a few years and millions of dollars working through all the bugs that Wombley Widgets worked through five years before. The worst (and surprisingly common) case, is that you'll spend ten years and tens of millions of dollars before having to scrap the whole project.

How about the Web?

Chicks at the New Jersey State Fair 1995.  Flemington, New Jersey. In the early days of the Web, publishers started with iron, usually a desktop machine running the Unix operating system. Then they'd hire a programmer who'd write a Perl/CGI script that pulled data out of and stuffed data into a Unix file in a custom format, i.e., their own database management system. By 1995, publishers had noticed that such programs tended to have a lot of strange mutual exclusion bugs; e.g., if two users simultaneously entered orders, the little custom database table would get corrupted. So publishers started building on top of a standard RDBMS such as Oracle, moving to where business data processing folks were in the 1970s. That's more or less where we are now in June 2003 as I write this chapter.

A Packaged Solution?

Could one develop and distribute a packaged solution to Web publishing as one does for word processing or corporate purchasing? It depends on one's level of intelligence and cunning. Suppose that we have a CD-ROM containing an "enterprise software system." Here are a couple of possible descriptions for the same software:

Described by the engineers who built the software Described by the marketing department
Here's a collection of hacks that we've assembled after building data processing systems for 15 companies. We're sorry that we never really finished it and that it doesn't do everything you need and that it will take 50 programmer-years to fill in the cracks and make it work for your business. But when you're done you'll probably have fewer bugs than if you'd started from scratch. This is a comprehensive turnkey business data processing system already in use by 15 large and sophisticated companies. It does absolutely everything you need and is so flexibly designed that it will only take you 50 programmer-years to customize for your unique business practices


Whose description is accurate? They both are. Which description do you think results in $35 million in software license revenue plus another $100 million for consulting?

One Packaged Solution or Ten?

It became fashionable in the late 1990s for people who bought information systems to purchase several "best-of-breed" applications from different vendors, drag these back to their server rooms, and proceed to put together an information system for their organization. Having paid their license fees, this assembly process is known as "systems integration". Suppose, for example, that the banner ad server relies on a table of users, a table of banner ads, a table of ad categories, and a table of which users have looked at which banner ads. Suppose that the best-of-breed user profiling system contains its own table of users, a table of article categories, and a table of which users have looked at which articles. Asking the question "Of the users who looked at articles about biking in France, how many clicked through on a bicycle ad?" required bringing in a big team of programmers to unify all the tables.

At Oracle Open World in 1999, Larry Ellison asked the audience to imagine if they bought a car the way that smart business people buy computer systems:

"BMW has the best fuel injection so I'll get BMW fuel injection. I really like those big Jeep wheels so I'll get Jeep wheels. I like the Mercedes engine and I'll put it all into a Porsche body. I'll have the best car in the world because each component is best-of-breed.

...

"People buy cars from one company at a time. This is why cars are cheap and reliable. Computers were supposed to make people more productive but because of the way people buy software, our industry has created a worldwide labor shortage."

The alternative is to build an integrated system, running out of a single relational database. You might not get every last feature of every last best-of-breed application, but it won't cost $10 million of system integration time to ask simple questions across modules.

Try to solve 100% of the problem or 40%?

Suppose that you're going to build a toolkit that lots of Web publishers can use, a packaged integrated system running out of one set of RDBMS tables. The question now is whether to try to solve 100% of the world's Web-supported collaboration problems or 40%. The human imagination and the pace of business change will doom any attempt to solve 100% of the problem. No matter how good a programmer one is, it isn't possible to build implementations of ideas that haven't been conceived yet. On the other hand, there isn't all that much new under the sun. So if we strive to embrace 100% of the problem perhaps our actual code will accomplish 90% of what the world needs. That's got to be better than 40%, no?

No!

Suppose that the total space of world-wide information system needs includes 100,000 features. A 90% solution will do 90,000 of those right out of the box. Does that mean that 90% of the world's sites can be built without writing any custom code? Only if each site required just one feature. If a site requires 10 features, there is only a 35% chance that the system can be build without new programming. If a site requires 20 features, the chance drops to 12%. If the site requires 100 features, which is getting to be a typical commercial situation, there is only about a 3 in 10,000 chance that the publisher will get away without writing any code.

Let's make this concrete. We're building foobar.com, which requires 100 features. We have a choice of Toolkit40 or Toolkit90. With Toolkit90, we only expect to have to program 10 new features as extensions. With Toolkit40, we have to program 60 new features. So it should be six times as much work to use Toolkit40, no?

No!

In software development, the vast majority of the code base is developed to accomplish the last few percent of the features. Thus, Toolkit40 will only have one-twentieth as many lines of code as Toolkit90. The programmers working from Toolkit40 will have to write six times as many features, but they will only have to read one-twentieth as much code. Furthermore, programming within a complex system may require more lines of code. So it is possible that the 10 new features built on top of Toolkit90 will contain more code than the 60 new features built on top of Toolkit40.

Another issue is that good programmers would rather write programs than read programs. So a project built on top of the lean, clean Toolkit40 will attract better people than a project built on top of bloated confusing Toolkit90. When you add in the fact that good programmers are 20 times more productive than average programmers, the probability of a success with Toolkit40 is much larger than the probability of success with Toolkit90.

What evidence is there of the truth of the foregoing? The ERP market contains products that aim to solve 100% of corporations' accounting problems. ERP software, of which SAP is the best-known example, comes very close to including 100% of the required features. Each company that adopts an ERP system only needs a handful of new features. Yet because of the complexity of the ERP toolkit as shipped, those features will take several years, 100 programmers, and $50 million to implement.

The following figures illustrate this point:

acs attempted coverage illustration

Figure 3-1: The outer rectangle contains the union of all the features that any oragnization might want from an information system. Individual features are represented by dots. An attempt to solve 100% of the problem and accomplish all of the desire features will, due to the nature of engineers, invariably yield at best a solution to 90% of the problems. The oval inside the rectangle shows the portion of the possible features accomplished by the software product.

acs attempted coverage illustration

Figure 3-2: Imagine a particular organization building an information system based on the 90%-solution toolkit. Most of what they want is indeed handled by the packaged software and only a little bit of custom programming need be done (hatched area). Unfortunately, the toolkit is so unwieldy due to its attempt to solve 100% of the problem that this little bit of coding takes years.

acs attempted coverage illustration

Figure 3-3: If you map the same organization's information system onto a less ambitious toolkit you can see that the amount of extension programming goes up considerably. What you don't see is that the total implementation effort may be much lower because the underlying toolkit is much simpler. There the programmers need spend much less time reading documentation, fitting their new software into the old, etc. Sometimes less is more.

Best Practices in 2003

There are only three software vendors of which one may be reasonably assured of long-term survival: Microsoft, IBM, and Oracle. Consequently it really doesn't make sense to consider purchasing enterprise software except from one of those three companies. The closest thing that Microsoft produces to the hypothetical toolkit described above is Sharepoint.

If you want to build it yourself but with a little assistance on development and structure, consider working through the steps outlined in Internet Application Workbook (http://philip.greenspun.com/internet-application-workbook/), a textbook used at MIT by teams of students building online communities from scratch. Most of the teams choose to start with the Microsoft .NET tools.

If you want to find some open-source toolkits that can speed the development process, Microsoft distributes quite a few for the .NET environment. The toolkit that evolved from the old photo.net online community is available from www.openacs.org.

If you want to save yourself several hundred thousand dollars, see if an off-the-shelf multi-user server-based Weblog ("blog") application will solve your problems. A good example of this kind of product is Manila, available from manila.userland.com at a retail price of between $300 and $900. Manila is open-source and its behavior can be modified in a safe scripting language.

Now the Hard Part

Clean Room.  Microdisplay Corporation. Suppose that you've addressed the six required elements of online community. Such a system represents only the beginning of the effort in building the kinds of Web systems that the world needs and wants. It is a reasonable start and collects enough data that a publisher can begin to do interesting computation. Here's an example:

Given a Web site with 1000 static .html files, a discussion forum, and all the services and information above. An expert shows up at the site and begins to participate in the discussion forum and comments on some of the static pages. I want the software to automatically recognize that this person is an expert. If the expert asks "What 1001st static document can I write that will help the community the most?" I want the software to be able to suggest some topics.
This example is as hard as the entire artificial intelligence problem and could occupy brilliant computer scientists for decades.

Brilliant computer scientists? The same ones that brought Microsoft Blue Screen of Death (TM) to your desktop and "server not responding" to your favorite ecommerce site? Or perhaps you'd rather trust the authors of the code that delayed the opening of Hong Kong's new $20 billion airport then crippled operations and left stranded passengers smelling dead fish and rotting fruit from the stalled cargo terminal.

On second thought maybe we should try to let the community users handle some of the programming themselves. Most of the Web technology that you can buy off-the-shelf presumes a mainframe-style "priesthood-that-develops-what-users-need" world, complete with the three-tiered architecture that shut down the Hong Kong airport. The best systems to support online communities, on the other hand, are built in such a way that genuinely hard things are left to a standard commercial relational database management system. Things that don't have to be hard are done in a safe interpreted computer language so that novice programmers running the community can modify and extend the software.

If we're smart enough to develop safe and effective languages, the power of programming need not be limited to the maintainers of a community Web site. The most useful and innovative services of all are often algorithms specified by users that run on the publisher's server, e.g., "send me mail every Monday and Thursday nights if there are any new articles by my friend Judy".

The programming chapters of this book illustrate the power and reliability of this software architecture for ecommerce and Web applications that replace desktop apps.

Summary

Collaboration technology can shape the way people work, learn, and live. Until the advent of the telephone, for example, the largest manageable companies had only a few hundred employees and had to be more or less in one building. Truly effective technology to support on-line communities will change assumptions that we don't even realize we've been making for the last 100 years.

More



or move on to Chapter 4: Static Site Development.


philg@mit.edu

Reader's Comments

I want the Title here If i don't know HTML, how'd I add title or apply formatting to my comments?

-- Rajesh A. S. Pethe, October 8, 2004
I think it more important than ever to keep a tight rein on the features incorporated into an application.

I built a simple order processing system whose main purpose was to enable lots of new items to be quickly entered onto the system without first setting up stock parts. The system was developed with Delphi 4 and Access 97. It worked fine. Then along comes Delphi 2005 together with supped up 3rd part controls, Microsoft Sql Server Express, .net etc. All this newly available technolgy exerts a pull and you soon start asking yourself: How can this new technology be used to improve my existing application ? It is not difficult usually to find ways in which it might.

Sql Server is much more secure than Access 97 for networking. I have to admit but for pressure of other work I was sorely tempted to have go at a technological upgrade. Fortunately, I realised it be a major undertaking. As an after thought I wondered how the application features themselves might be improved on the current platform. I couldn't see much scope as it was a fairly bare bones system and I would want to leave any major new features until I upgraded and could take advantaged of new technology. But I was wrong.

For all the standard reasons, the idea was that orders would be entered sometime before the goods were delivered so that when they arrived they could be quickly checked in and we could check we only got what we had ordered. However, as the goods were always new to the organisation and were not fully described on the order forms, they could only be entered after they had been delivered and examined. Really, the order system wasn't being used as intended. However, it was still necessary to go through all the effort of entering orders before deliveries could be accepted. All we really needed was a system to record deliveries. At a stroke a large part of the system could be eliminated.

I'm sure the above small scale example must apply to many larger scale systems. I was interested to read that as recently as 2003 Zara (one of Europes most sucessful fashion chains ) was running its tills on the Dos operating system and moving data around terminals on floppy disc - see "Zara: IT for Fast Fashion" Harvard Business School 9-604-081. It makes you wonder if one of the most succesful companies can operate this way what the less succesful companies are doing with their networked computer systems. Have they really thought through their requirements or are the IT professionals effectively in charge and application uopgrade = techological upgrade ?

-- Andrew Johnson, November 25, 2005

All this is just as relevant today as when you wrote it. Many thanks for this stuff.

-- Neil Roberts, November 12, 2007
Add a comment | Add a link