"It is not hard to see why the government of a region becomes less and less manageable with size. In a population of N persons, there are of the order of N2 person-to-person links needed to keep channels of communication open. Naturally, when N goes beyond a certain limit, the channels of communication needed for democracy and justice and information are simply too clogged, and too complex; bureaucracy overwhelms human process. ...Let's also remind ourselves of the empirical evidence that enormous online communities cannot satisfy every need. America Online has not subsumed all the smaller communities on the Internet. People unsubscribe from mailing lists when the traffic level becomes too high. Early adopters of USENET discussion groups (called "Netnews" or "Newsgroups" back in the 1970s and "Google Groups" to most people in 2005) stopped participating because they found the utility of the groups diminished when the community size grew beyond a certain point.
"We believe the limits are reached when the population of a region reaches some 2 to 10 million. Beyond this size, people become remote from the large-scale processes of government. Our estimate may seem extraordinary in the light of modern history: the nation-states have grown mightily and their governments hold power over tens of millions, sometimes hundreds of millions, of people. But these huge powers cannot claim to have a natural size. They cannot claim to have struck the balance between the needs of towns and communities, and the needs of the world community as a whole. Indeed, their tendency has been to override local needs and repress local culture, and at the same time aggrandize themselves to the point where they are out of reach, their power barely conceivable to the average citizen."
So the good news is that, no matter how large one's competitors, there will always be room for a new online community. The bad news is that growth results in significant engineering challenges. Some of the challenges boil down to simple performance engineering: How can one divide the load of supporting an Internet application among multiple CPUs and disk drives? These can typically be solved with money, even in the absence of any cleverness. The deeper challenges cannot be solved with money and hardware. Consider, for example, the following questions:
It isn't challenging to throw hardware at a performance problem. What is challenging is setting up that hardware so that the service is working if any of the components are operational rather than only if all of the components are operational.
We'll examine each layer individually.
Suppose that we have a popular application and need 16 CPUs to support all the database queries. And let's further suppose that we've decided that the RDBMS will run all by itself on one or more physical computers. Should we buy 16 small computers, each with one CPU, or one big computer with 16 CPUs inside? The local computer shop sells 1-CPU PCs for about $500, implying a total cost of $8000 for 16 CPUs. If we visit the Web site for Sun Microsystems (www.sun.com) we find that the price of a 16-CPU Sunfire 6800 is too high even to list, but if the past is any guide we won't get away for less than $200,000. We will pay 25 times as much to get 16 CPUs of the same power, but all inside one physical computer.
Why would anyone do this?
Let's consider the peculiarities of the RDBMS application. The RDBMS server talks to multiple clients simultaneously. If Client A updates a record in the database and, a split-second later, Client B requests that record, the RDBMS is required to deliver the updated information to Client B. If we were to spread the RDBMS server program across multiple physical computers, it is possible that Client A would be served from Computer I and Client B would be served from Computer II. A database transaction cannot be committed unless it has been written out to the hard disk drive. Thus all that these computers need do is check the disk for updates before returning any results to Client B. Disk drives are 100,000 times slower than RAM. A single computer running an RDBMS keeps an up-to-date version of the commonly used portions of the database in RAM. So our multi-computer RDBMS server that ensures database coherency across processors via reference to the hard disk will start out 100,000 times slower than a single-computer RDBMS server.
Typical commercial RDBMS products, such as Oracle Parallel Server, work via each computer keeping copies of the database in RAM and informing each other of updates via high-speed communications networks. The machine-to-machine communication can be as simple as a high-speed Ethernet link or as complex as specialized circuit boards and cables that achieve memory bus speeds.
Don't we have the same problem of inter-CPU synchronization with a multi-CPU single box server? Absolutely. CPU I is serving Client A. CPU II is serving Client B. The two CPUs need to apprise each other of database updates. They do this by writing into the multiprocessor machine's shared RAM. It turns out that the CPU-CPU bandwidth available on typical high-end servers circa 2002 is 100 Gbits/second, which is 100 times faster than the fastest available Gigabit Ethernet, FireWire, and other inexpensive machine-to-machine interconnection technologies.
Bottom line: if you need more than one CPU to run the RDBMS, it usually makes most sense to buy all the CPUs in one physical box.
The abstraction layer is sometimes referred to as "business logic". Something that is complex and fundamental to the business ought to be separated out so that it can be used in multiple places consistently and updated in one place if necessary. Below is an example from an e-commerce system that Eve Andersson wrote. This system offered substantially all of the features of amazon.com circa 1999. Eve expected that a lot of ham-fisted programmers who adopted her open-source creation would be updating the page scripts in order to give their site a unique look and feel. Eve expected that laws and accounting procedures regarding sales tax would change. So she encapsulated the looking up of sales tax by state, the figuring out if that state charges tax on shipping, and the multiplication of tax rate by price into an Oracle PL/SQL function:
The Web script or other PL/SQL procedure that calls this function need only know the proposed cost of an item, the proposed shipping cost, and the order ID to which this item might be added (these are the three arguments to
create or replace function ec_tax (v_price IN number, v_shipping IN number, v_order_id IN integer) return number IS taxes ec_sales_tax_by_state%ROWTYPE; tax_exempt_p ec_orders.tax_exempt_p%TYPE; BEGIN SELECT tax_exempt_p INTO tax_exempt_p FROM ec_orders WHERE order_id = v_order_id; IF tax_exempt_p = 't' THEN return 0; END IF; SELECT t.* into taxes FROM ec_orders o, ec_addresses a, ec_sales_tax_by_state t WHERE o.shipping_address=a.address_id AND a.usps_abbrev=t.usps_abbrev(+) AND o.order_id=v_order_id; IF nvl(taxes.shipping_p,'f') = 'f' THEN return nvl(taxes.tax_rate,0) * v_price; ELSE return nvl(taxes.tax_rate,0) * (v_price + v_shipping); END IF; END;
ec_tax). That sales taxes for each state are stored in the
ec_sales_tax_by_statetable, for example, is hidden from the rest of the application. If an organization that adopted this software decided to switch to using third-party software for calculating tax, that organization would need to change only this one function rather than wading through hundreds of Web scripts looking for tax-related code.
Should the abstraction layer run on its own physical computer? For most applications, the answer is "no". These procedures are not sufficiently CPU-intensive to make splitting them off onto a separate computer worthwhile in terms of system administration effort and increased vulnerability to hardware failure. What's more, these procedures often do not even warrant a new execution environment. Most procedures in the abstraction layer of an Internet service require intimate access to relational database tables. That access is fastest when the procedures are running inside the RDBMS itself. All modern RDBMSes provide for the execution of standard procedural languages within the database server. This trend was pioneered by Oracle with PL/SQL and then Java. With the latest Microsoft SQL Server one can supposedly run any .NET-supported computer language inside the database.
When should you consider a separate environment ("application server" process) for the abstraction layer? Suppose that a big bank, the result of several mergers, has an IBM mainframe to manage checking accounts, an Oracle RDBMS for managing credit accounts, and a SQL Server-based customer support system. If Jane Customer phones up the bank and asks to pay her credit card bill from her checking account, a computer program needs to perform a transaction on the mainframe (debit checking), a transaction on the Oracle system (credit Visa card), and a transaction on the SQL Server database (payment handled during a phone call with Agent #451). It is technically possible for, say, a Java program running inside the Oracle RDBMS to connect to these other database management systems but traditionally this kind of problem has been attacked by a stand-alone "application server", usually a custom-authored C program. The term "application server" has subsequently become used to describe the physical computers on which such a program might run and, in the late 1990s, execution environments for Java or C programs that served some function on a Web site other than page presentation or persistence.
Another example of where a separate physical application server might be desirable is where substantial computation must be performed. On most photo sharing sites, every time a photo is uploaded the server must create scaled versions in standard sizes. The performance challenge at the orbitz.com travel site is even more serious. Every user request results in the execution of a Lisp program written by MIT Artificial Intelligence Lab alumni at itasoftware.com. This Lisp program searches through a database of two billion flights and fares. The database machines that are performing transactions such as ticket bookings would collapse if they had to support these searches as well.
If separate physical CPUs are to be employed in the abstraction layer, should they all come in the same box or will it work just as well to rack and stack cheap 1-CPU machines? That rather depends on where state is kept. Remember that HTTP is a stateless protocol. Somewhere the server needs to remember things such as "Registered User 137 wants to see pages in the French language", "Unregistered user who started Session 6781205 has placed the hardcover edition of The Cichlid Fishes in his or her shopping cart." In a multi-process multi-computer server farm, it is impossible to guarantee that a particular user will always be returned to the same running computer program, if for no other reason than you want the user experience to be robust to failure of an individual physical computer. If session state is being kept anywhere other than in a cookie or the persistence layer (RDBMS), your application server programs will need to communicate with each other constantly to make sure that their ad hoc database is coherent. In that case, it might make sense to get an expensive multi-CPU machine to support the application server. However, if all the layers are stateless except for the persistence layer, the application server layer can be handled by multiple cheap one-CPU machines. At orbitz.com, for example, racks of cheap computers are loaded with identical local copies of the fare and schedule database. Each time a user clicks to see the options for traveling from New York to London, one of those application server machines is randomly selected for action.
The most common place for script execution is within the operating system process occupied by the Web server. In other words, the script language interpreter is built into the Web server. Examples of this architecture are Microsoft Internet Information Server (IIS) and Active Server Pages, AOLserver and its built-in Tcl interpreter, Apache and the mod_perl add-in. If you've chosen to use one of these popular styles of Web development, you've chosen to merge the presentation layer with the HTTP service layer, and spreading the load among multiple CPUs for one layer will automatically spread it for the other.
The multi-CPU box versus multiple-separate-box decision here should again be based on whether or not the presentation layer holds state. If no session state is held by the running presentation scripts, it is more economical to add CPUs inside separate physical computers.
The main reason that people run out of capacity on a single front-end Web server is that HTTP server programs are usually packaged with software to support computationally more expensive layers. For example, the Oracle RDBMS server, capable of supporting the persistence layer and the abstraction layer, also includes the necessary software for interpreting Java Server Pages and performing HTTP service. If you were running a popular service directly from Oracle you'd probably need more than one CPU. More common examples are Web servers such as IIS and AOLserver that are capable of handling the presentation and HTTP service layers from the same operating system process. If your scripts involve a lot of template parsing, it is easy to overload a single CPU with the demands of the Web server/script interpreter.
If no state is being stored in the HTTP Service layer it is cheapest to add CPUs in separate physical boxes. HTTP is stateless and user interaction is entirely mediated by the RDBMS. Therefore there is no reason for a CPU serving a page to User A to want to communicate with a CPU serving a page to User B.
There are two ways to protect your users' privacy from packet sniffers. The first is by using a newer version of Internet Protocol, IPv6, which provides native data security as well as authentication. In the glorious IPv6 world, we can be sure of the origin of a packet, whether it is from a legitimate user or a denial-of-service attacker. In the glorious IPv6 world, we can be sure that it will be impractical to sniff credit card numbers or other user-sensitive data from Web traffic. As of spring 2005, however, it isn't possible to sign up for a home IPv6 connection. Thus we are forced to fall back on the 1990s-style approach of adding a layer between HTTP and TCP. This was pioneered by Netscape Communications as Secure Sockets Layer (SSL) and is now being standardized as TLS 1.0 (see http://www.ietf.org/html.charters/tls-charter.html).
However it is performed, encryption is processor-intensive. On the client side, that's not a big deal. The client machine probably has a 2 GHz processor that is 98 percent idle. However on the server end performing encryption can tie up a whole CPU per user for the duration of a request.
If you've run out of processing power the only thing to do is ... add processing power. The question is what kind and where. Adding general-purpose processors to a multi-CPU computer is very expensive as mentioned earlier. Adding additional single-CPU front-end servers to a two-tier server farm might not be a bad strategy especially because, if you're already running a two-tier server farm, it requires no new thinking or system administration skills. It is possible, however, that special-purpose hardware will be more cost-effective or easier to administer. In particular it is possible to do encryption in the router for IPv6. SSL encryption for HTTP connections can be done with plug-in boards, an example of which is the Compaq AXL300, PCI card, available in 2005 for $1400 with a claimed performance of handling 330 SSL connections per second. Finally it is possible to interpose a hardware encryption machine between the Web server, which communicates via ordinary HTTP, and the client, which makes requests via HTTPS. This feature is, for example, an option on load-balancing routers from F5 Networks (www.f5.com).
You might ask what CPU speed is this 10 hits per second per CPU number based upon? The number is independent of CPU speed! In the mid-1990s, we had 200 MHz CPUs. Web scripts queried the database and merged the results with strings embedded in the script. Everything ran on one physical computer so there was no overhead from copying data around. Only the final credit card processing pages were encrypted. We struggled to handle 10 hits per second. In the late 1990s we had 400 MHz CPUs. Web scripts queried the database and merged the results with templates that had to be parsed. Data were networked from the RDBMS server to the Web server before heading to the user. We secured more pages in response to privacy concerns. We struggled to handle 10 hits per second. In 2000 we had 1 GHz CPUs. Web scripts queried the referer header to find out if the request came from a customer of one of our co-brand partners. The script then selected the appropriate template. We'd freighted down the server with Java Server Pages and Enterprise Java Beans. We struggled to handle 10 hits per second. In 2002 we had 2 GHz CPUs. The programmers had decided to follow the XML/XSLT fashion. We struggled to handle 10 hits per second....
It seems reasonable to expect that hardware engineers will continue to deliver substantial performance improvements and that fashions in software development and business complexity will continue to rob users of any enjoyment of those improvements. So stick to 10 requests per second per CPU until you've got your own application-specific benchmarks that demonstrate otherwise.
We will start by positing a two-tier server farm with a single multi-CPU machine running the RDBMS and multiple single-CPU front-end machines, each of which runs the Web server program, interprets page scripts, performs SSL encryption, and generally does any computation not being performed within the RDBMS.
**** insert drawing of our example server farm ****
How was the CNN system experienced by users? When a student at MIT requested http://www.cnn.com/TECH/, his or her desktop machine would ask the local name server for a translation of the hostname www.cnn.com into a 32-bit IP address. (Remember that all Internet communication is machine-to-machine and requires numeric IP addresses; alphanumeric hostnames such as "www.amazon.com" or "web.mit.edu" are used only for user interface.) The MIT name server would contact the InterNIC registry to learn the IP addresses of the name servers for the cnn.com domain. The MIT name server would then contact CNN's name servers and learn that "www.cnn.com" was available at the IP address 188.8.131.52. Subsequent users within the same subnetwork at MIT would, for a period of time designated by CNN, get the same answer of 184.108.40.206 without the MIT name server going back to the CNN name servers.
Where is the load balancing in this system? Suppose that a Biology major at Harvard University requested http://www.cnn.com/HEALTH/. Harvard's name server would also contact CNN's name servers to learn the translation of "www.cnn.com". This time, however, the CNN server would provide a different answer: 220.127.116.11, leading that user, and subsequent users within Harvard's network, to a different front-end server than the machine providing pages to users at MIT.
Round-robin DNS is not a very popular load balancing method today. For one thing, it is not very balanced. Suppose that the CNN name server tells America Online's name server that www.cnn.com is reachable at 18.104.22.168. AOL is perfectly free to provide that translation to all of its more than 20 million customers. Another problem with round-robin DNS is the impact on users when a front-end machine dies. If the box at 22.214.171.124 were to fail, none of AOL's customers would be able to reach www.cnn.com until the expiration time on the translation had elapsed—the site would be up and running and providing pages to hundreds of thousands of users worldwide, but not to those users who'd received an unlucky DNS translation to the dead machine. For a typical domain, this period of time might be anywhere from 6 hours to 1 week. CNN, aware of this problem, could shorten the expiration and "minimum time-to-live" on cnn.com but if these were cut down to, say, 30 seconds, the load on CNN's name servers might start approaching the intensity of the load on its Web servers. Nearly every user page request would be preceded by a request for a DNS translation. (In fact, CNN set their minimum time-to-live to 15 minutes.)
A final problem with round-robin DNS is that it does not provide abstraction. Suppose that CNN, whose primary servers were all Unix machines, wished to run some discussion forum software that was only available for Windows. The IP addresses of all of its servers are publicly exposed. The only way to direct users to a different machine for a particular part of the service would be to link them to a different hostname, which could therefore be translated into a distinct IP address. For example, CNN would link users to "http://forums.cnn.com". Users who enjoyed these forums would bookmark the URL, and other sites on the Internet would insert hyperlinks to this URL. After a year, suppose that the Windows servers were dying and the people who knew how to maintain them had moved on to other jobs. Meanwhile, the discussion forum software has become available for Unix as well. CNN would like to pull the discussion service back onto its main server farm, at a URL of http://www.cnn.com/discuss/. Why should users be aware of this reshuffling of hardware?
**** insert drawing of server farm (cloud), load balancer, public Internet (cloud) ****
The modern approach to load balancing is the load balancing router. This machine, typically built out of standard PC hardware running a free Unix operating system and a thin layer of custom software, is the only machine that is visible from the public Internet. All of the server hardware is behind the load balancer and has IP addresses that aren't routable from the rest of the Internet. If a user requests www.photo.net, for example, this is translated to 126.96.36.199, which is the IP address of photo.net's load balancer. The load balancer accepts the TCP connection on port 80 and waits for the Web client to provide a request line, e.g., "GET / HTTP/1.0". Only after that request has been received does the load balancer attempt to contact a Web server on the private network behind it.
Notice first that this sort of router provides some inherent security. The Web servers and RDBMS server cannot be directly contacted by crackers on the public Internet. The only ways in are via a successful attack on the load balancer, an attack on the Web server program (Microsoft Internet Information Server suffered from many buffer overrun vulnerabilities), or an attack on publisher-authored page scripts. The router also provides some protection against denial-of-service attacks. If a Web server is configured to spawn a maximum of 100 simultaneous threads, a malicious user can effectively shut down the site simply by opening 100 TCP connections to the server and then never sending a request line. The load balancers are smart about reaping such idle connections and in any case have very long queues.
The load balancer can execute arbitrarily complex algorithms in deciding how to route a user request. It can forward the request to a set of front-end servers in a round-robin fashion, taking a server out of the rotation if it fails to respond. The load balancer can periodically pull load and health information from the front-end servers and send each incoming request to the least busy server. The load balancer can inspect the URI requested and route to a particular server, for example, sending any request that starts with "/discuss/" to the Windows machine that is running the discussion forum software. The load balancer can keep a table of where previous requests were routed and try to route successive requests from a particular user to the same front-end machine (useful in cases where state is built up in a layer other than the RDBMS).
Whatever algorithm the load balancer is using, a hardware failure in one of the front-end machines will generally result in the failure of only a handful of user requests, i.e., those in-process on the machine that actually fails.
How are load balancers actually built? It seems that we need a computer program that waits for a Web request, takes some action, then returns a result to the user. Isn't this what Web server programs do? So why not add some code to a standard Web server program, run the combination on its own computer, and call that our load balancer? That's precisely the approach taken by the Zeus Load Balancer (http://www.zeus.com/products/zlb/) and mod_backhand (http://www.backhand.org/mod_backhand/), a load balancer module for the Apache Web server. An alternative is exemplified by F5 Networks, a company that sells out-of-the-box load balancers built on PC hardware, the NetBSD Unix operating system, and unspecified magic software.
It seems as though the load-balancing router out front and load-balancing operating system on the RDBMS server in back have allowed us to achieve goals 1 and 3. And if the hardware failure occurs in a front-end single-CPU machine, we've achieved goal 2 as well. But what if the multi-CPU RDBMS server fails? Or what if the load balancer itself fails?
Failover from a broken load balancer to a working one is essentially a network configuration challenge, beyond the scope of this textbook. Basically what is required are two identical load balancers and cooperation with the next routing link in the chain that connects your server farm to the public Internet. Those upstream routers must know how to route requests for the same IP address to one or the other load balancer depending upon which is up and running. What keeps this from becoming an endless spiral of load balancing is that the upstream routers aren't actually looking into the TCP packets to find the GET request. They're doing the much simpler job of IP routing.
Ensuring failover from a broken RDBMS server is a more difficult challenge and one where a large variety of ideas has been tried and found wanting. The first idea is to make sure that the RDBMS server never fails. The machine will have three power supplies, only two of which are required. Each disk drive will be mirrored. If a CPU board fails, the operating system will gracefully fail back to running on the remaining CPUs. There will be several network cards. There will be two paths to each disk drive. Considering the number of moving parts inside, the big complex servers are remarkably reliable, but they aren't 100 percent reliable.
Given that a single big server isn't reliable enough, we can buy a whole bunch of them and plug them all into the same disk subsystem, then run something like Oracle Parallel Server. Database clients connect to whichever physical server machine is available. If they can't get a response from a particular server, the client retries after a few seconds to another physical server. Thus an RDBMS server machine that fails causes the return of errors to any in-process user requests being handled by that machine and perhaps a few seconds of interrupted or slow service for users who've been directed to the clients of that down machine, but it causes no longer term site unavailability.
As discussed in the "Persistence Layer" section of this chapter, this approach entails a lot of wasted CPU time and bandwidth as the physical machines keep each other apprised of database updates. A compromise approach introduced by Oracle in 2000 was to configure a two-node parallel server. The first machine would process online transactions. The second machine would be allowed to lag as much as, say, ten minutes behind the first in terms of updates. If you wanted a CPU-intensive report querying last month's user activity, you'd talk to the backup machine. If Machine #1 failed, however, Machine #2 would notice almost immediately and start rolling its own state forward from the transaction log on the hard disk. Once Machine #2 was up to date with the last committed transaction, it would begin offering service as the primary database server. Oracle proudly stated that, for customers willing to spend twice as much for RDBMS server hardware, the two-node failover configuration was "only a little bit slower" than a single machine.
Be explicit about the number of computers employed, the number of CPUs within each computer, and the connections among the computers.
Your answer to this exercise should be no longer than half a page of text.
Be explicit about the number of computers employed, the number of CPUs within each computer, and the connections among the computers. If you're curious about the real numbers, remember that eBay is a public corporation and publishes annual reports, which are available at http://investor.ebay.com/.
Your answer to this exercise should be no longer than one page.
Note: http://philip.greenspun.com/ancient-history/webmail/ describes an Oracle-backed Web mail system built by Jin S. Choi.
Perhaps we can take some ideas from the traditional face-to-face world. Let's look at some of the things that make for good offline communities and how we can translate them to the online world.
How do we translate the features of identifiability, authentication, and accountability into the online world? In private communities, such as corporate knowledge management systems or university coordination services, it is easy. We don't let anyone use the system unless they are an employee or a registered student and, in the online environment, we identify users by their full names. Such heavyweight authentication is at odds with the practicalities of running a public online community. For example, would it be practical to schedule face-to-face meetings with each potential registrant of photo.net, where the new user would show an ID? On the other hand, as discussed in the "User Registration and Management" chapter, we can take a stab at authentication in a public online community by requiring email verification and by requiring alternative authentication for people with Hotmail-style email accounts. In both public and private communities, we can enhance accountability simply by making each user's name a hyperlink to the complete record of their contributions to the site.
In the face-to-face world, a speaker gets a chance to gauge audience reaction as he or she is speaking. Suppose that you're a politician speaking to a women's organization, the WAGC ("Women Against Gun Control", www.wagc.com). Your schedule is so heavy that you can't recall what your aides told you about this organization, so you plan to trot out your standard speech about how you've always worked to ensure higher taxes, more government intervention in individuals' lives, and, above all, to make it more difficult for Americans to own guns. Long before you took credit for your contribution to the assault rifle ban, you'd probably have noticed that the audience did not seem very receptive to your brand of paternalism and modified your planned speech. Typical computer-mediated communication systems make it easy to broadcast your ideas to everyone else in the service, but without an opportunity to get useful feedback on how your message is being received. You can send the long email to the big mailing list. You'll get your first inkling as to whether people liked it or not after the first 500 have it in their inbox. You can post your reply to an emotionally charged issue in a discussion forum, but you won't get any help from other community members, at least not through the same software, before you finalize that reply.
Perhaps you can craft your software so that a user can expose a response to a test audience of 1 percent of the ultimate audience, get a reaction back from those sample recipients, and refine the message before authorizing it for delivery to the whole group.
This causes a properly configured browser to launch the AIM client (try it). Although the AIM-based chat offered superior interactivity, it was not as successful due to (1) some users not having the AIM software on their computers, (2) some users being behind firewalls that prevented them from using AIM, but mostly because (3) photo.net users knew each other by real names and could not recognize their friends by their AIM screen names. It seems that providing a breakout and reassemble chat room is useful, but that it needs to be tightly integrated with the rest of the online community and that, in particular, user identity must be preserved across all services within a community.
<a href="aim:gochat?RoomName=photonet">photo.net chatroom</a>
People like computers and the Internet because they are fast. If you want an answer to a question, you turn to the search engine that responds quickest and with the most relevant results. In the offline world, people generally desire speed. A Big Mac delivered in thirty seconds is better than a Big Mac delivered in ten minutes. However, when emotions and stakes are high, we as a society often choose delay. We could elect a president in two weeks, but instead we choose presidential campaigns that last nearly two years. We could have tried and sentenced Thomas Junta immediately after July 5, 2000, when he beat Michael Costin, father of another ten-year-old hockey player, to death in a Boston-area ice rink. After all, the crime was witnessed by dozens of people and there was little doubt as to Junta's guilt. But it was not until January 2002 that Junta was brought to trial, convicted, and sentenced to six to ten years in prison. Instant messaging, chat rooms, and Web-based discussion forums don't always lend themselves to thoughtful discourse, particular when the topic is emotional.
|"As an online discussion grows longer, the probability of a comparison involving Nazis or Hitler approaches 1" — (Mike) Godwin's Law|
How difficult is it in the offline world to find people interested in the issues that are important to us? If you believe that charity begins at home and all politics is local, finding people who share your concerns is as simple as walking around your neighborhood. One way to translate that to the online world would be to build separate communities for each geographical region. If you wanted to find out about the environment in your state, you'd go to massachusetts.envrionmentaldefense.org. But what if your interests were a bit broader? If you were interested in the environment throughout New England, should you have to visit five or six separate servers in order to find the hot topics? Or suppose that your interests were narrower. Should you have to wade through a lot of threads regarding the heavily populated eastern portion of Massachusetts if you live right up against the New York State border and are worried about a particular chemical plant?
The geospatialized discussion forum, developed by Bill Pease and Jin S. Choi for the scorecard.org service, is an interesting solution to this problem. Try out the following pages:
Another way to look at geospatialization is of the users themselves. Consider, for example, an online learning community centered around the breeding of African Cichlids. Most of the articles and discussion would be of interest to all users worldwide. However it would be nice to help members who were geographically proximate find each other. Geographical clumps of members can share information about the best aquarium shops and can arrange to get together on weekends to swap young fish. To facilitate geospatialization of users, your software should solicit country of residence and postal code from each new user during registration. It is almost always possible to find a database of latitude and longitude centroids for each postal code in a country. In the United States, for example, you should look for the "Gazetteer files" on www.census.gov, in particular those for ZIP Code Tabulation Areas (ZCTAs).
Despite applying the preceding tricks, it is always possible for growth in a community to outstrip an old user's ability to cope with all the new users and their contributions. Every Internet collaboration system going back to the early 1970s has drawn complaints of the form "I used to like this [mailing list|newsgroup|MUD|Web community] when it was smaller, but now it is big and full of flaming losers; the interesting thoughtful material is buried under a heavy layer of dross." The earliest technological fix for this complaint was the bozo filter. If you didn't like what someone had to say, you added them to your bozo list and the software would hide their contributions from your view of the community.
In mid-2001 we added an "inverse bozo filter" facility to the photo.net community. If you find a work of great creativity in the photo sharing system or a thoughtful response in a discussion forum you can mark the author as "interesting". On subsequent logins you will find a "Your Friends" section in your personal workspace on the site. The people that you've marked as interesting are listed in order of their most recent contribution to the site. Six months after the feature was added 5,000 users had established 25,000 "I think that other user is interesting" relationships.
What is it about a newspaper that makes it particularly tough for that organization to act as the publisher of an online community?
/doc/planning/YYYYMMDD-scalingon your server and start writing a scaling plan for your community. This plan should list those features that you expect to modify or add as the site grows. The features should be grouped by phases.
Add a link to your new plan from
/doc/ or a planning
Let's look at some concrete scenarios. Let's assume that we have a public community in which user-contributed content goes live immediately, without having to be approved by a moderator. The problem of spam is greatly reduced in any community where content must be pre-approved before appearing to other members, but such communities require a larger staff of moderators if discussion is to flow freely.
Scenario 1: Sarah Moneylover has registered as User #7812 and posted 50 article comments and discussion forum messages with links to her "natural Viagra" sales site. Sarah clicked around by hand and pasted in a text string from a word processor open on her desktop, investing about 20 minutes in her spamming activity. The appropriate tool for dealing with Sarah is a set of efficient administration pages. Here's how the clickstream would proceed:
userstable associated with User #7812 ought to be deleted as well.
Scenario 2: Ira Angrywicz, User #3571, has developed a grudge against Herschel Mellowman, User #4189. In every discussion forum thread where Herschel has posted, Ira has posted a personal attack on Herschel right underneath. The procedure followed to deal with Sarah Moneylover is not appropriate here because Ira, prior to getting angry with Herschel, posted 600 useful discussion forum replies that we would be loathe to delete. The right tool to deal with this problem is an administration page showing all content contributed by User #3571 sorted by date. Underneath each content item's headline are the first 200 words of the body so that the administrator can evaluate without clicking down whether or not the message is anti-Herschel spam. Adjacent to each content item is a checkbox and at the bottom of all the content is a button marked "Disapprove all checked items." For every angry reply that Ira had to type, the administrator had to click the mouse only once on a checkbox, perhaps a 100:1 ratio between spammer effort and admin effort.
Scenario 3: A professional programmer hired to boost a company's
search engine rank writes scripts to insert content all around the
Internet with hyperlinks to his client's Web site. The programs are
sophisticated enough to work through the new user registration pages
in your community, registering 100 new accounts each with a unique
name and email address. The programmer has also set up robots to
respond to email address verification messages sent by your software.
Now you've got 100 new (fake) users each of whom has posted two
messages. If the programmer has been a bit sloppy, it is conceivable
that all of the user registrations and content were posted from the
same IP address in which case you could defend against this kind of
attack by adding an
originating_ip_address column to your
content management tables and building an admin page letting you view
and potentially delete all content from a particular IP address.
Discovering this problem after the fact, you might deal with it by
writing an admin page that would summarize the new user registrations
and contributions with a checkbox bulk-nuke capability to remove those
users and all of their content. After cleaning out the spam you'd
probably add a "verify that you're a human" step in the user
registration process in which, for example, a hard-to-read word was
obscured inside a patterned bitmap image and the would-be registrant
had to recognize the word amidst the noise and type it in. This would
prevent a robot from establishing 100 fake accounts.
No matter how carefully and intelligently programmed a public online community is to begin with, it will eventually fall prey to a new clever form of spam. Planning otherwise is like being an American circa 1950 when antibiotics, vaccines, and DDT were eliminating one dreaded disease after another. The optimistic new suburbanites never imagined that viruses would turn out to be smarter than human beings. Budget at least a few programmer days every six months to write new admin pages or other protections against new ideas in the world of spam.