|
|
Discussion Forum Server Specification
for "Software Engineering of Internet Applications" at MIT
Site Home : Teaching :
6.171 : One Item
|
We will conduct an experiment to figure out whether "real name"
discussion forums are more useful to participants than traditional
mostly anonymous discussion forums. We will use the legacy users of the
legacy site www.greenspun.com
(the /bboard section) to quickly get a significant sample size.
The Basic Idea
We will operate a server where anyone can come to establish a discussion
forum. The person who sets up a forum is called the "publisher." The
publisher can decide whether a forum is public or private and whether
the forum will be a "screen name forum" in which users are identified by
whatever name they choose to type in, plus perhaps an email address
(could be "samoyedlover@hotmail.com" and therefore does not identify
anyone) or a "real name forum" in which users are identified by full
name and their city of residence. In the real name forum, users are
authenticated by a small credit card charge or by a referral/approval
from an existing real name member.
We test whether or not a discussion group is effective with some of the metrics
developed by the community research group at Microsoft
(http://research.microsoft.com/community/)
and their flagship system http://netscan.research.microsoft.com/.
Are questions getting answered to the original poster's satisfaction?
Is abusive language being used? Are discussions deteriorating into
flame wars?
We write an academic journal paper summarizing our findings. Our
hypothesis is that the real name, identified and authenticated
communities will be much more effective for participants.
Practicalities
We transfer all of the legacy content from greenspun.com and continue to
serve it at the same URLs (/bboard) so that we don't break links from Google and
other places around the Internet. On every page, we invite people to
start a new discussion forum at /groups using the new software. We also
spam out email invitations to all of the registered publishers on
greenspun.com telling them about the new improved service.
Features Desired
First and foremost we need powerful anti-spam defenses. When
registering, a user should have to decode a word in one of those
hard-to-read GIFs. Publishers should have the option that all content
must be approved before going live as well as "delete all content from
this IP address" or "delete all content from this user" or "delete all
content containing the following string" (with an "are you sure you want
to do this" if the result is more than 10 messages). The server should
be able to distinguish between trusted (has posted several messages that
have been approved) and untrusted (new or disapproved) users. Should be
an option to let postings from trusted users go live immediately.
The discussion interface should be very clean and simple, which is what
attracted all those publishers and users to greenspun.com/bboard in the
first place (they could have used Yahoo! Groups, but they chose not to).
Only one format for discussions: question and answer. No option for
threading.
Categorization tools for users and publishers so that postings can be
categorized.
Soft deletion only, except for spam. Bad content is moderated so far
down that it is almost impossible to find, but it is still there in the
database.
Click on a person's name and you see all of their contributions.
Email spamming system for the publisher.
Discussion forum is available via RSS.
System for figuring out if a forum is not being actively moderated, in
which case it must be shut down because it becomes a magnet for spam.
Some Nice-to-Have Features
- ability to attach files to discussion forum postings
- customer service system where the labor can be hired in India and
the user asking for service is required to make an online donation to
www.sarasanctuary.org (I will pay the guy in India out of my own pocket,
but I want people to have some incentive to contact the
publisher/moderators before they contact me) -- a typical user request
is "I posted this thing 5 years ago and now I want it removed" (and
maybe the publisher/moderator has died or given up
Miscellaneous
There is some legacy static content on the site belonging to my brother.
We get rid of this and give him and his kid a Wordpress Weblog (do we
need to install more than one?).
Tasks
- export Oracle from legacy www.greenspun.com server and import into
PostgreSQL (use Perl, maybe
http://www.samse.fr/GPL/ora2pg/ora2pg.html#description)
- contact Marc Smith and Peter Kollock to get metric suggestions
(assigned: Philip; email sent April 8; need to follow up by phone)
- design user registration and discussion data models
- design real name verification systems
- experiment with real name verification via credit cards
https://www.paypal.com/IntegrationCenter/ic_home.html
is what Paypal can do. A key document might be the API Reference
- spec customer service system (data model, page flow)
- contact people in India to get leads on good workers (assigned:
Philip; last email to guy in Delhi, April 8)
- consider a method for users to flag off-topic content, look at
Craigs list (try to keep very simple) -- should be installed at the
option of the group owner
- research full-text search for private forums (tsearch2 ?) and use Google
search for public forums (assigned: Brian; since he has to do it for
6.171 project anyway; use same mechanism as for ECAC)
- migrate Harry and Benjamin's content to either Wordpress or Drupal
(assigned: Philip and Shimon; decisions... Wordpress is okay. One blog,
both Harry and Benjamin able to post; implies we need to install mysql)
- get email forwarding set up
- figure out how much bandwidth greenspun.com is using and find a
long-term host (textdrive? assigned: Shimon to analyze one day of
traffic; DONE; result was 1GB of server log traffic on a Tuesday, which
translates into nearly 2 GB of "hosting traffic" because of some
overhead stuff the ISP adds)
- figure out the optimum block size for the production Postgres. 8K
is the default, but that was set in the days when a server might have
had a 100 MB cache at most. We can probably have a 1 GB cache and we
probably don't want the database to be working that hard to break up
longer user writings. So maybe a 32K block size makes sense given that
the average bboard row might be 1K bytes in size. The disadvantage of a
large block size is that you pull in a lot of unrelated stuff when you
touch one row, but memory is getting ever cheaper.
- Philip and Shimon have determined that via the TOAST mechanism,
Postgres can store arbitrarily large fields
philg@mit.edu