How long is the average Internet discussion forum posting?

A friend of mine who works at a database management system company asked for thoughts on how long a string a database table needs to be able to store, as a practical matter, to serve most Internet programming needs.  This prompted me to do some queries into the photo.net discussion forum.  Here’s my message to him, which I thought would be interesting to nerd readers….


Three basic issues for Web development relating to varchar/clob:


1) Strings are uploaded from a browser-rendered TEXTAREA are of a length that is impossible for the programmer to predict. In a sense, then, every text slot in the database must be prepared to accept a string of arbitrary length.

http://www.cs.tut.fi/~jkorpela/forms/textarea.html#browserlimits reveals that some browsers have limits of 32K or 64K but that as Microsoft and Mozilla get more sophisticated these seem to be disappearing.

2) Software developers of Internet applications are often first-time SQL programmers and sometimes first-time programmers altogether. Unless a DBMS can make CLOBs work with every SQL function and command these novice programmers must learn a whole new computer language, essentially, to deal with CLOBs.


3) Internet applications are often developed using feeble ad hoc tools, such as PHP (my students in 6.171, MIT’s Software Engineering for Internet Applications, mostly picked PHP to do their semester project even though I would have discouraged this, being a mistruster of thrown-together unnecessary new languages). Many of these tools don’t have facilities for dealing with anything beyond the basic SQL data types so they couldn’t use CLOBs if they wanted to.


I think for Web development it is reasonable to expect the average string length from the user to be 300 chars, despite wanting to be prepared for a maximum of 32K or even larger. Oops. Typing that prompted me to do the query (see below). Averaging 2 million messages on photo.net, the correct number is 425. The histogram query reveals that out of 2 million messages over a 10-year period a 32K limit would have resulted in 6 messages being rejected and a 16K limit something like 30 rejections.


If you wanted to implement something like Salon.com as a single RDBMS table for both articles and comments on articles I think a 64K limit might be required. If someone authors a 5000-word magazine article in Microsoft Word and then saves as HTML that will be 25-30k of content plus at least a factor of 2 in HTML tags and other Microsoft-added filler.

http://www.photo.net/bboard/q-and-a-fetch-msg?msg_id=002oFh is an example of one of the big postings on photo.net. It is 37962 chars long, the HTML is very clean (i.e., much less filler than if saved by Word), and yet the page doesn’t seem excessively long.

So… my conclusion from looking at the queries below is that 32K would do the job for a pure discussion forum system and that it would be marginal for storing articles unless a publisher decided that everything should be broken up into “part 1”, “part 2”, and “part 3”. Looking at the .html files on photo.net, the vast majority are in fact under 32K and most are under 64K.


However, if you look at

http://philip.greenspun.com/seia/ the largest chapters are 88.7k.

If you want to facilitate novice programmers building full-scale content management systems where all the content is uploaded from browsers it might be necessary to make the varchar datatype bloat up to 100k-ish. But 32K would be adequate to build something like eBay, amazon (user-uploaded content and much of the publisher content as well), or photo.net discussion forums.


———– some stats from photo.net


select avg(dbms_lob.getlength(MESSAGE)),count(*) from bboard;


AVG(DBMS_LOB.GETLENGTH(MESSAGE))   COUNT(*)
——————————– ———-
        424.672669    2052290


select round(dbms_lob.getlength(MESSAGE),-3), count(*)
from bboard
group by round(dbms_lob.getlength(MESSAGE),-3);



ROUND(DBMS_LOB.GETLENGTH(MESSAGE),-3) COUNT(*)
————————————- ———-
        0  1452035
     1000   510236
     2000    58330
     3000    11972
     4000     3399
     5000     1303
     6000      481
     7000      264
     8000      138
     9000       93
    10000       66


    11000       38
    12000       22
    13000       21
    14000       17
    15000        6
    16000       12
    17000        9
    18000       14
    19000        4
    20000        4
    21000        1

    22000        2
    23000        1
    24000        4
    25000        3
    26000        2
    27000        1
    30000        2
    32000        1
    33000        2
    34000        1
    37000        1

    38000        1

Note:  Oracle did a much better job formatting these in SQL*Plus; for some reason the tabs didn’t carry through after cutting and pasting.


——————————


Epilogue (not from my email to the friend):


Look how much fun it is to program SQL.  Three lines of code and you get an interesting answer (and those three lines would have been much cleaner and simpler if we hadn’t been forced to use the CLOB datatype, which has its own strange accessor functions).  Compare to Java and C where typing until your fingers fall off usually doesn’t result in much of anything.  SQL, Lisp, and Haskell are the only programming languages that I’ve seen where one spends more time thinking than typing.

Full post, including comments