Online Community Integration

a product/business idea by Philip Greenspun in October 2009
This is a product design ready for a team of young programmers to build. It is available free and without restriction.

After 40 years, among the most promising and popular uses of the Internet is education, especially informal unstructured education. People get answers with Google, learn from reading Wikipedia, and figure out how to accomplish tasks by watching videos on YouTube. The most dramatic effect of the Internet has been to expand the number of potential teachers. No longer are we restricted to learning from full-time teachers and full-length book authors. Someone who is an expert on how to shuck an oyster can make a two-minute video on the subject and help many a struggling chef at home.

One of the earliest and most effective means of connecting learners and teachers was the online community. Embodied first in mailing lists, then in discussion forums, and finally in comprehensive Web sites, the online community provides a way for people interested in sharing their expertise to answer questions. One of my favorite examples is this question about using red filters with black and white film (from photo.net, a community that the author developed in 1993).

For people to take maximum advantage of online communities, however, they need to participate in quite a few. Consider Joe Average, a suburban homeowner and parent, whose hobbies include taking pictures and videos of his kids and flying small airplanes. For Joe to take best advantage of opportunities for informal learning and teaching, he would have to belong to the following online communities:

Due to the flood of spam that has engulfed the Internet, each of these communities will involve a moderately cumbersome registration and authentication process. Joe may take the trouble to register and log in to each of these communities at the time when the community topic is first and foremost in his mind. He will, however, likely stop participating when other interests become more pressing. This represents a loss to other people around the world. For example, if Joe completes his instrument rating and stops visiting the "learn how to fly" forum, he won't be there to catch the "Who is a good instrument instructor at [Joe's local airport]?" question. Joe would have been delighted to answer this question if it he had seen it, but he wasn't going to take the trouble to type in a username/password at each of 10 different communities each evening.

Lightweight Aggregation

This problem has been solved to some extent for people who simply wish to read information from multiple Web sources. A reader who wants to keep current with five Weblogs, three public discussion forums, and two newspapers can sign up for 10 RSS feeds and have a desktop or Web-based tool combine the information from those 10 sources on one page.

There are several problems with news readers or RSS aggregators. First, they don't have very good tools for filtering and highlighting, so a person can't reasonably subscribe to too many sources (see "Solar Magnitude Forum" for an idea on how to give readers access to the most interesting parts of a very large discussion). Second, answering a question requires the potential teacher to (1) visit the underlying Web site, (2) remember the username/password for that Web site, (3) type in the username/password, (4) possibly navigate back to the question, and (5) post a response using an infrequently used and unfamiliar interface.

Heavyweight Aggregation

If discussions in a general area are fragmented among many disparate communities, why not bring them all together on one server, sort of the way that photo.net does for photography enthusiasts? Within photo.net forums are sections for Canon EOS users, Nikon users, nature photographers, wedding photographers, etc. A single login lets a wedding photographer using Canon EOS cameras participate in both relevant forums. (The "Solar Magnitude Forum" is supposed to be a fix for the inevitable fragmentation of community occasioned by dividing up into separate forums.)

Alternatively, instead of getting all of the world's biggest photo nerds on one server, get all of the world's Internet users on one server and then let them form whatever interest groups they like. This was the AOL strategy in the 1980s and 1990s and is the Facebook strategy now. The problem with this approach is that the enormous umbrella community never captures a large enough percentage of the experts needed and the software lacks whatever specialized tools are needed for a particular topic (e.g., a gardening community may need a collaboratively-maintained taxonomy off which to hang photos, discussion, articles, etc.).

Let's accept that the Internet has too many entrenched specialized online communities for Facebook groups to take over. A popular standalone community has active motivated participants who know each other by name and reputation. The archives may stretch back more than a decade. Advertising revenue may be enough to give the publisher a strong motivation to keep improving the service.

This business: Medium-weight Aggregation

At a minimum, in order to be a full participant in online learning communities, Joe Average needs (1) a system that will filter multiple communities down to the most relevant and interesting postings using his criteria, (2) a way of responding to questions that is not substantially more difficult than answering a personal email.

This starts with some elements of lightweight aggregation. The canonical repositories for discussion remain on their existing sites. Maybe it is a standalone community such as gardenweb.com. Maybe it is a Google or Yahoo Group. Maybe it is a group within Facebook. With RSS feeds or screen-scraping scripts and using the participant's username/password credentials, the aggregator pulls discussion forum postings from the underlying communities, highlighting those that are likely to be most interesting and combining discussions from multiple sources on one page.

What if the participant, after reading a posting, wants to post a response? He or she can type it into a form on the aggregated page and the service will use his or her username/password to post it in the appropriate place on the underlying community. Suppose the participant wants to see something about the reputation of a poster or other contextual information? He or she clicks through to the underlying community, already conveniently logged in.

Because we've already got all of the text of discussion forums from hundreds or thousands of online communities, it is easy to make a unified mobile phone interface to all of them. Even if the community software hasn't been touched since 1996 and no thought was given to viewing/posting on a 3-inch screen, the discussion will now be usable from a smartphone.

As long as we're integrating communities we can add the best features of the best community software to every community. For example, on photo.net in the late 1990s, as the community grew beyond 100,000 registered users, we added the ability for a reader to tag a forum contributor as an "interesting person". This enabled the system to show the reader a page of all new content from all members that he or she had previously tagged as thoughtful or skilled.

We can track all questions posted and provide email notifications of responses, as well as highlighting threads that were initiated by or participated in by our customer.

Given that the server is pulling information from thousands of online communities on a daily basis, it should be easy to support a directory of active communities where a question is likely to be answered. Trying searching for "aquarium online community" or "aquarium forum". Then sort through the sites that Google returns and see how long it takes to determine which forums are active and have a good percentage of questions adequately answered. Marc Smith and his group at Microsoft Research were able to develop quite a few metrics of online community quality. These can be computed automatically and used to significantly assist a user in finding a helpful discussion forum.

How does it add value for users?

This service adds value to users in the following ways:

How does it make money?

Ads. The reader is looking at content from underlying sites but doing so on a page served by the aggregator's computer. This gives the builder of the aggregation/integration service the opportunity to sell advertisements. The underlying sites will still get substantial ad revenue because (1) the more content that is submitted, the more pages that they have that could conceivably carry an ad, and (2) readers using an aggregation service will still sometimes click through to the community directly, perhaps more frequently than if they did not have access to the service. Getting more people to participate more often in online communities should grow the pie so that everyone is better off: more content, more page views, more advertising revenue.

Suppose that a publisher of one of the underlying communities does not see things this way and begins to block the IP address of a central aggregation server. Nothing stops all of this software from running inside a browser on a reader's desktop machine, in which case the page requests should not be distinguishable from ordinary browsing and therefore won't be blockable.

If the interface is a sufficient improvement over that of the underlying sites and the service catches on, the potential for profits is enormous because the business need not invest in content. Every new user adds a tiny bit of cost in terms of servers and programming but adds a lot of page views that can support advertising.

How does it make more money?

Marketing departments at big companies. Quite a few companies currently pay $100,000 to $500,000 annually to services such as Crimson Hexagon to find out what consumers are saying about their products in "the social web". Some of these tools provide a time-saving means for these companies to respond to postings, either honestly or dishonestly. For example, Bosch might pay to see all of the discussions where people mention their dishwashers. If someone says "my Bosch failed after two weeks" they can post a response "Just email me and I will make sure it gets fixed -- Bosch Customer Care" or, fraudulently, "I'm a homeowner in Cleveland and my Bosch dishwasher has worked flawless for 15 years. I love the sparkly shine."

If we can come up with a set of features that are mostly valuable to companies trying to manage their brand image and reputation, it should be possible to charge $10,000 or more annually for them to use essentially the same service that consumers are using for free.

How does one build it?

A "mere matter of programming" as the great Jin S. Choi has been known to say. The only non-standard challenge presented by this project is the need to develop screen-scraping regular expressions for individual online communities that are run on custom software. This can be done on-demand. For example, when the first saltwater fish tank nerd subscribes and wants to participate in reefcentral.com, the programmers can scramble to get it hooked up. A lot of comparison shopping services depend on similar methods, oftentimes using small teams of programmers in lower wage countries.

Importance of Security

Add something extra for firewalls and security. This system will have custody of important username/password pairs and must be trusted and trustworthy. An unresolved problem is that a user of Gmail and Google Groups, for example, may be using the same username/password for both. He or she either must trust this new service sufficiently to store the email account password or create a new Google ID for use with Google Groups.

Can we do this with Google Groups or Yahoo Groups?

In accordance with the process outlined in "Software Design Review", before we rush off to build new software we should ask the question "Can we do this with existing software or services?"

The most obvious candidates are Google Groups and Yahoo Groups. These services allow anyone to create a simple online community. With a single username/password, which may already be what one is using to read email, a participant can read and answer questions in multiple communities. Neither Google nor Yahoo seems interested in integration, however. Neither service allows a participant to view threads from multiple groups on a single page. To see what's going on inside 5 groups, the user has to click down into Group 1, look at a list of discussion topics, then click back up to the top level, click down into Group 2, look at a list of topics, etc. Nor does either service allow the importation of discussion threads via RSS from communities hosted elsewhere.

Who has tried this before?

Microsoft attacked this problem from a slightly different angle, with its Passport service. Every Web community in the world would pay Microsoft a fee to authenticate their users through Passport, later renamed Windows Live ID. A user would pick a name/password once on a Microsoft server and be able to use that everywhere on the public Internet. OpenID is a newer version of the same idea, this time not controlled by a single for-profit corporation.

A single-sign on system such as Microsoft Passport or OpenID solves the problem of remembering a lot of different usernames and passwords, but it doesn't solve the problem that a participant in 20 different forums will have to click the mouse at least 40 times to see what's new and whether there is anything worth responding to.

Conclusion

This would be cheap to build and, given that it is entirely text, cheap to host. It would save users a lot of time and streamline their lives. It could make the publisher/owner a lot of money if the service becomes popular.

Why I posted this

I'm too busy with some other projects right now to finance and manage this business, so I'm making the product plan available to any young energetic team worldwide that would like to do it. Good luck and send me the URL so that I can be your first user!

About the Author

Philip Greenspun was an early developer of software to support Web-based online communities, starting in the early 1990s, and was an early proponent of using the relational database management system as the backend source of persistence. He wrote the chapter "Scalable Systems for Online Communities" in the late 1990s and a textbook on developing Internet applications, published in 2006 by MIT Press.

More: resume.

Cesar Brea, a former management consultant at Bain and Monitor and now head of Force Five Partners, threw a few buckets of cold salt water on an earlier draft of this document.

Estimated Costs and Revenue

Based upon the author's experience with Google Ads in discussion forum pages, revenue should be approximately $2 per 1000 page views. Using Google Ads per se may be slightly challenging becuase the service is designed for static Web pages that it can read in advance. The Google infrastructure is capable of producing relevant ads in response to real-time updates, as it does for Gmail, but this capability is not exposed to other publishers.

The typical active user of online communities may visit 60 or 70 pages per day (each page being one discussion forum thread), which makes the potential revenue per user about $50 per year. If the aggregation service grows to the size of a single moderately popular online community, about 100,000 active users, that could be $5 million per year in revenue.

What about costs? Because the service need not crawl the Web or store any information from the underlying communities, the hardware and software infrastructure required will be modest. Given U.S.-based programming, system administration, and hosting, a reasonable long-term budget for technology is probably around $1 million per year. About half of that would be spent on maintenance and security. The other half would be spent on developing new features and interfacing to new custom-coded communities. At least a small offshore labor group would be required so that the cost of interfacing to a new community does not exceed the revenue derived from it.

For a credible out-of-the-gate start, the service should probably interface to at least the mostly popular 1000 online communities (a big service such as Google Groups, Facebook, or Twitter would count as 1). Let's assume 1000 programmer-days to build those interfaces (some will be tough, but some will be easy due to the use of similar software), at an offshore labor rate of $150 per day. That's $150,000 in startup costs to have the most popular communities set up and ready to go. Add to that another $50,000 to build the core service and we're talking about $200,000 in technical startup costs.

As far as hosting goes, this one rather cries out for cloud computing. Customers will have more faith in a cloud computing vendor's ability to provide security than they will in the average ISP's. There is no advantage to having all of the customers on one big computer. This is a more or less personal service and there is no downside to each customer having a dynamically assigned personal server. The nice thing about cloud computing in this case is that the hardware/hosting costs will scale with customers and usage and also that the system could handle a big spike in usage if it became popular.

Marketing methods and costs are tougher to predict. Though I have some ideas of my own, I'm going to leave that as an exercise for the entrepreneur.

Exit Strategy? What if this were funded as a venture capital-backed business? How could the original investors get their cash back? This company would be a natural to sell to any of the big media companies that are good at wringing the last advertising dime out of a page view. Examples include Demand Media, Marchex, and NameMedia.

Free Software

In case you're too lazy to get started, here is the beginning of an SQL data model for the service... (Oracle syntax)

-- Multi-community data model, by philg@mit.edu, October 2009
-- available under the GNU General Public License 

-- create a sequence for user_id
create sequence user_id_sequence start with 1; 

create table users (
    user_id integer primary key,
    first_names varchar(50),
    last_name varchar(50) not null,
    email varchar(100) not null unique,
    -- we encrypt passwords using operating system crypt function
    password varchar(30) not null,
    registration_date timestamp(0)
);

-- multiple users might belong to photo.net or facebook, for example, so we represent 
-- everything common about one of those sites in this table

-- create a sequence for community_id
create sequence community_id_sequence start with 1; 

create table communities (
    community_id        integer primary key,
    community_name      varchar(4000) not null,
    community_url       varchar(200) not null, -- the home page, e.g., http://photo.net
    -- here's where the real engineering happens; we need to store all of the code and patterns 
    -- necessary to log into a particular community
    -- perhaps our life is easy and this uses a standard toolkit such as Wordpress
    standard_toolkit_id integer references standard_toolkits, -- note that column may be NULL if the community is custom-coded
);

-- this next table doesn't really need its own generated key, but it
-- probably makes life easier if using Web development tools that expect 
-- to see a single-column key

create sequence uc_map_id_sequence start with 1;

create table user_community_map (
    uc_map_id          integer primary key,
    -- the next two columns, taken together, are the real key (index below enforces their key-ness)
    user_id            integer not null references users,
    community_id       integer not null references communities,
    username           varchar(200),
    password           varchar(200),
    cookie             varchar(4000), -- this way we won't have to keep logging in
    -- interest level may vary with season, e.g., user will be very interested in skiing community in November, but reduce level in March
    interest_level     integer default 2; -- from 1 (most interested) to 5 (least), how much stuff does our user want to see from this community?
);

create unique index uc_map_key_idx on user_community_map (user_id, community_id);

-- here we store things that a user wants to pick out from a community even if his interest level is low at the time
-- should we consider a skinny table architecture instead?  With a pattern column and a search_type column?

create sequence user_filter_id_sequence start with 1;

create table user_filters (
    filter_id        integer primary key,
    uc_map_id        integer not null references user_community_map,
    -- one of the following should not be NULL; the rest will be
    -- look for a regexp in the subject line of a discussion forum posting
    subject_regexp
    -- look for a regexp anywhere in posting (subject, body, author)
    anywhere_regexp
    -- look for a particular author
    author
);

-- the primary key makes it fast to ask "to which communities does user 678 belong"; now an index to make it fast to ask
-- "which of our users belong to community 342?"

create index user_community_map_by_cu on user_community_map ( community_id, user_id );

create sequence standard_toolkit_id_sequence start with 1;

create table standard_toolkits (
    toolkit_id        integer primary key,
    toolkit_name      varchar(100) not null,
    -- we'll need a lot more here!
);


Text and photos copyright 2009 Philip Greenspun.
philg@mit.edu

Reader's Comments

Philip, I think it is a very good idea. I wonder if statistical linguistic analysis would be a practical and productive adjunct to manual, iterative regexp coding.

Baysian spam filters like DSPAM have been developed to learn very effectively for their intended purpose, but their capabilities have the potential to extend beyond the realm of SPAM filtration. If the guts of a spam filter were to be deployed against the 'scraped' content, it should be able to identify the patterns relevant to the subscriber with a high and improving degree of reliability. What's more, the software would be responsive to user feedback - it would learn what mattered to the subscriber.

This might allow the software to develop and continually improve independently of programming hours invested.

Regards Richard

-- Richard Hamilton, December 1, 2009

It's a great idea, and I know users would love it. But I have two issues with it.

1.) It takes away the revenue stream from content creators, who are depending on eyeballs of viewers to see their ads, not yours. There could be copyright issues, but regardless, it disincentives the content creators or aggregators.

2.) If it works for me, it will work for a spammer. And then it won't work at all. People will demand forums where this isn't allowed to reduce the amount of spam (and drunken tweets that are familiar)

-- Aaron Evans, August 25, 2010

Add a comment | Add a link