How long is the average Internet discussion forum posting?

A friend of mine who works at a database management system company asked for thoughts on how long a string a database table needs to be able to store, as a practical matter, to serve most Internet programming needs.  This prompted me to do some queries into the photo.net discussion forum.  Here’s my message to him, which I thought would be interesting to nerd readers….


Three basic issues for Web development relating to varchar/clob:


1) Strings are uploaded from a browser-rendered TEXTAREA are of a length that is impossible for the programmer to predict. In a sense, then, every text slot in the database must be prepared to accept a string of arbitrary length. http://www.cs.tut.fi/~jkorpela/forms/textarea.html#browserlimits reveals that some browsers have limits of 32K or 64K but that as Microsoft and Mozilla get more sophisticated these seem to be disappearing.


2) Software developers of Internet applications are often first-time SQL programmers and sometimes first-time programmers altogether. Unless a DBMS can make CLOBs work with every SQL function and command these novice programmers must learn a whole new computer language, essentially, to deal with CLOBs.


3) Internet applications are often developed using feeble ad hoc tools, such as PHP (my students in 6.171, MIT’s Software Engineering for Internet Applications, mostly picked PHP to do their semester project even though I would have discouraged this, being a mistruster of thrown-together unnecessary new languages). Many of these tools don’t have facilities for dealing with anything beyond the basic SQL data types so they couldn’t use CLOBs if they wanted to.


I think for Web development it is reasonable to expect the average string length from the user to be 300 chars, despite wanting to be prepared for a maximum of 32K or even larger. Oops. Typing that prompted me to do the query (see below). Averaging 2 million messages on photo.net, the correct number is 425. The histogram query reveals that out of 2 million messages over a 10-year period a 32K limit would have resulted in 6 messages being rejected and a 16K limit something like 30 rejections.


If you wanted to implement something like Salon.com as a single RDBMS table for both articles and comments on articles I think a 64K limit might be required. If someone authors a 5000-word magazine article in Microsoft Word and then saves as HTML that will be 25-30k of content plus at least a factor of 2 in HTML tags and other Microsoft-added filler. http://www.photo.net/bboard/q-and-a-fetch-msg?msg_id=002oFh is an example of one of the big postings on photo.net. It is 37962 chars long, the HTML is very clean (i.e., much less filler than if saved by Word), and yet the page doesn’t seem excessively long.


So… my conclusion from looking at the queries below is that 32K would do the job for a pure discussion forum system and that it would be marginal for storing articles unless a publisher decided that everything should be broken up into “part 1”, “part 2”, and “part 3”. Looking at the .html files on photo.net, the vast majority are in fact under 32K and most are under 64K.


However, if you look at http://philip.greenspun.com/seia/ the largest chapters are 88.7k.


If you want to facilitate novice programmers building full-scale content management systems where all the content is uploaded from browsers it might be necessary to make the varchar datatype bloat up to 100k-ish. But 32K would be adequate to build something like eBay, amazon (user-uploaded content and much of the publisher content as well), or photo.net discussion forums.


———– some stats from photo.net


select avg(dbms_lob.getlength(MESSAGE)),count(*) from bboard;


AVG(DBMS_LOB.GETLENGTH(MESSAGE))   COUNT(*)
——————————– ———-
        424.672669    2052290


select round(dbms_lob.getlength(MESSAGE),-3), count(*)
from bboard
group by round(dbms_lob.getlength(MESSAGE),-3);



ROUND(DBMS_LOB.GETLENGTH(MESSAGE),-3) COUNT(*)
————————————- ———-
        0  1452035
     1000   510236
     2000    58330
     3000    11972
     4000     3399
     5000     1303
     6000      481
     7000      264
     8000      138
     9000       93
    10000       66

    11000       38
    12000       22
    13000       21
    14000       17
    15000        6
    16000       12
    17000        9
    18000       14
    19000        4
    20000        4
    21000        1

    22000        2
    23000        1
    24000        4
    25000        3
    26000        2
    27000        1
    30000        2
    32000        1
    33000        2
    34000        1
    37000        1

    38000        1


Note:  Oracle did a much better job formatting these in SQL*Plus; for some reason the tabs didn’t carry through after cutting and pasting.


——————————


Epilogue (not from my email to the friend):


Look how much fun it is to program SQL.  Three lines of code and you get an interesting answer (and those three lines would have been much cleaner and simpler if we hadn’t been forced to use the CLOB datatype, which has its own strange accessor functions).  Compare to Java and C where typing until your fingers fall off usually doesn’t result in much of anything.  SQL, Lisp, and Haskell are the only programming languages that I’ve seen where one spends more time thinking than typing.

25 thoughts on “How long is the average Internet discussion forum posting?

  1. Sure wish a historical company named Software House here in Cambridge had jumped on the SQL Wagon years ago with their System 1022! More to the point, I wish that ‘I’ had learned the language years ago … I’d probably be a lot more capable of securing employment today. The beauty of SQL is that MySQL is freely available and powerful, indeed. Thanks for the ‘fun’, Phil.

  2. SQL is a great report-generating macro-language, but using it effectively requires understanding the database structures, indexes and keywords. I guess it helps if you designed and have access to the database at PN 😉

  3. CLOBs for long strings are Sooo last century, an Oracle relic. These days if you implement a website you want to run effortlesly and without the need for a team of full-time DBAs, you stay away from Oracle to begin with. Well, that’s my opinion anyway.

    In the commercial arena, that pretty much leaves Sybase and MS SQL Server. The former supports 16K and if you need more than that and are using Java, the Sybase JDBC driver takes care of CLOBs seamlesly, just treat it as a varchar in your dml or queries. MS-SQL is a bit behind with 8K, but still enough for most message boards. (PS: if you do use that and Java, stay away from the MS JDBC driver, it is too feeble to use indexes when you use a PreparedStatement. The drivers from http://www.inetsoftware.de/ are much superior)

    But if you are a smart web developer, you use Postgres, which has a 2GB varchar limit.

    Dumb developers use MySQL, however. That also has a 2GB TEXT type. Considering most of those users will be too dumb to be able to use LOB features like in Oracle, but many message boards are being written using it an PHP, I’ll assume that even that combo allows you to treat these in the same way as varchars.

    So as long as you use good (and free in cases) tools, ie: Postgres and AOLserver or Java, you will have no problems with CLOB usage for these long fields.

    But the rest of your article/analysis makes for an interesting read. Another reason for capping the length of these text field, either by DB constraints or in your code, is DOS attacks that might want to fill up your DB…

  4. It’s true about SQL – it is a great language. You do spend more time thinking than coding.
    In that respect Java sometimes drives me nuts – many’s the time I actually can’t type fast enough to get the Java code out of my head onto the page. Now I know Mozart had that problem with musical notation and since I’m definitely no genius, so the signal-to-noise ratio of Java must be horrible!

    One really really should learn lisp you know. My excuse is that I’m waiting for Paul Graham’s Arc…

  5. Richard, I share your pain about Java. The only thing I can say is that a good IDE helps. For the resource hog it is, IDEA’s code completion is so good it speeds you up incredibly. Wouldn’t want to use anything else anymore!

  6. Which PHP-based (or other suspicious-language) message boards are you thinking of that actually demonstrate problems?

  7. I wouldn’t know, I don’t keep track. In any case, that isn’t a good question, the WELL KNOWN ones are well known because they work properly! The worst offenders are the one-off websites. Sometimes made by newbies, sometimes by people that should have realized they were in the wrong profession a long time ago.

    To see how bad a language PHP is, simply look at the manual; pages and pages of different ways to do the same thing. And the one-shot architecture combined with a forking webserver makes for many problems under high load. As for MySQL, turn to the page in the manual where they explain what their idea of an “online-backup” is. And that’s just for starters.

  8. why not just store the location of the post in the DB?

    You could also take advantage of the file system for searches too.

  9. Dodicula: After you’ve improved the system by storing each posting as a file in the Unix or Windows file system you can further improve the system by storing information about the postings in a vertical-bar-separated flat file. Then if performance is too slow you can add an index file. To make sure that simultaneous updates don’t collide, you just add a lock file each time that your Web scripts are manipulating the information in the file system. For consistent backups, you shut the system down for a couple of hours and write everything out to DVD-ROM. When you’re done you’ll have saved yourself all the trouble of downloading and installing Postgres or Oracle!

  10. I do like the way some blogs like Moveable Type work by storing posts in a DB and writing out to the file system when changes occur. You can still use the DB for searches, data manipulation and backup.

  11. PHP is a “thrown-together unnecessary new language”? Whats wrong with PHP? Are you dismissing it because you don’t know anything about it, or do you have some specific gripes.

    “To see how bad a language PHP is, simply look at the manual; pages and pages of different ways to do the same thing.”

    So you can do something in more than one way with PHP. Hmmm sounds a lot like C/C++. I suppose C/C++ are bad langauges too?

  12. The PHP+MySQL bashing is really silly IMV. First off, PHP is what it is, it’s evolved into something smelly, but it works and has flourished under shared hosting and even IBM now has issued some press ink on how they’re going to ditch java and start looking at PHP. It’s far from a perfect language, but it gets the job done. I’ve coded Java and for most web tasks, I’d say Java is an inappropriate solution and riddled with problems galore. PHP5 corrects the haphazard and halfhearted OOP implementation.

    As far as “dumb developers” use MySQL. Well, MySQL is used because it works. Maybe it’s not suitable for running an online bank but it fits the bill for what it’s being used for. And it handles large volumes of data just fine. I’m not impressed with Oracle at all, having seen it melt down more than a few times, and the bleeding (and look at the pathological liar Larry Ellison’s bank account!) just went on and on. I’ve worked in the real world where databases handle millions of transactions, and I will tell you that none of those DB2 and IMS systems (and even older, now esoteric hardware/software configurations which I’m not going to confess to right now) used any fancy SQL, as performance would have taken a hit, so on those enterprise applications, you don’t see a lot of “magic SQL” or subquerys or stuff that MySQL is criticised for. And your wallet is going to be a lot fatter, not paying the exhorbitant Oracle license and support fees.

    Additionally, Philip, your notion of using the *nix/Win file system is total folly where you’d have to give nobody/webuser id write access to entire directory trees. Even using mysql as a “flatfile” repository is preferable to that ill advised adoption. Philip, for a pretty smart guy, that’s an incredible oversight. Database access, given that the developer takes pains to properly filter data provided by posters and has alotted provisions to deal with other sorts of mischief (flood attacks, spambots, etc…) is going to be a lot more angst free.

    I’ve been building web apps for clients for 4 or 5 years now, and have not had any major problems with MySQL other than need to optimize/repair tables on an infrequent basis. And if the volume of traffic gets heavy, page caching and other techniques will still have your sites available.

    Sorry, but I’ve not seen a JSP/Tomcat web platform that I thought rivals anything done with PHP or Perl or Python. Even the ones that work good still suffer from variable response times, java runtime lockups, and goofy UI considerations like not being able to hit browser “back” button. So what should a web app be coded in? Tcl?, how is that any different than Perl or PHP? LISP? I could see components written in LISP but the bulk of the system. Perl? Is Perl not a “thrown-together” language also, that brings a whole host of issues when running mod-perl. Remember, in the non-academia world, most sites (including commercial sites) are hosted on virtual servers. Even the corporate world sports a lot of Apache/*nix, to go along with .net/asp and Websphere/(other Java platforms), using Perl and/or PHP. C/C++? You’re joking right? Unless you already have an existing backend you want to plug into. Python or Ruby? Now, there you have an argument, though they still present some problems, given the minority rung they occupy in the current scheme of things, thinking of code you can cannibalize, frameworks that can be adopted, etc..…

    I think your vision of PHP + MySQL is so 1999ish…

  13. Paul, yes, C/C++ are very bad languages for anything other that system plumbing and high performance servers. (but not applications written to run in those servers, obvioulsy!)

    Nuam, is it? I have looked at PHP5 and while things have improved, the only thing it still has going for it easy of deployment. But that could have been achieved without creating a whole new language that is so feature bloated that the very beginners it attracts can’t work out what the best way of doing something is. Never mind hacking other people’s code! Of course it gets the job done, but there are far more elegant solutions that are more fun to maintain.

    Fancy SQL isn’t for the enterprise having to do daily interest calculations on 20 million accounts. It’s for small and medium operations. Considering that 99.99999% of “LAMP” websites run on a single server, it makes sense to do as much data filtering and joining in SQL. With correct indexes, this will be as fast or better than any custom code, especialy when that code pulls more data out of the DB than it needs to. All that without the chore of writing and debugging that code! Only when you need a farm of multiple web servers does it start to make sense to take the load off the database server and distribute it, as replicating webservers is a lot easier than clustering databases.

    and none of that solves me being unable to do an friggin’ online backup for my “24/7/365” busines! I can see the attraction of PHP for many finished products as it’s the only easily distributable thing. However, I don’t understand why anyone would choose MySQL over Postgres, especialy Open Sores whores; MySQL is far from free, both the speech and beer kinds.

    PS: regarding Philips’s last post: http://dictionary.reference.com/search?q=sarcasm 🙂

  14. Programming environment elitism is so much fun. It always seems to be those in the losing, unpopular camp that have to claim to be smartest. Unfortunately for Bas, MySQL has won the war and any developer that wants a career can skip the Postgres.

    PHP may be ugly at times but it works well and predictably, supports everything on earth, and it’s installed on millions and millions of servers.

  15. Well, yes, MySQL has won (so far), just like MS-DOS won the OS wars in the 80s. But that’s because, unfortunately, in our world worse is better, not because MySQL is anything but a half-assed sorry excuse for a CSV handler with SQL syntax.

    (There are many other examples of “worse is better” in the technology world. Just ask Phil about his opinions on LISP vs. C…).

  16. mysql> select avg(length(message)), count(*) from posts;
    avg(length(message)) count(*)
    128.9908 2005468
    1 row in set (8.89 sec)

  17. No, you have that all wrong, tlack. Looking at monster.co.uk, if you want a 25K/year carreer, you pimp yourself as “PHP and MySQL developer”. I prefer the rather more lucrative carreer that requires me to know how to code Java, C, various scripting languages and quickly pick up anything else they throw at me. All the while using all 3 major commercial RDBMSs and various other overpriced, underperforming and unreliable pieces of middleware crap “enterprises” just love to waste their money on. Not to mention all the Unix and internet plumbing most of those 25K PHP developers are oblivious to.

    Postgres is just for fun at home or dot-coms.

  18. PS: Philip, thanks for getting me started in that carreer with photo.net/wtr! 🙂

  19. Again, more snobbery.

    I’ve also had a “lucrative career” requiring me to know how to code “COBOL, BAL, Java, C, REXX, CLIST, Perl, $RANDOMSCRIPTINGORPROPRIETARY LANG, to go along with the “overpriced, underperforming middleware” and “UNIX and internet plumbing” as you eloquently express. However, building sites with F/OSS like MySQL and PHP has been lucrative for me too, both in terms of getting to do more enjoyable work and financially, though not in the sense of occupying a “25K annual salary” slot as you allude to.

    And those Java programmer slots are soon to become 25K year jobs too. In they eyes of the CIO, in much of the corporate world, all coding is low level grunt “monkey work”, to be farmed out to offshore vendor bodyshops in Asian locales, or staffed by imported non-immigrant visa workers. Unless it’s a special niche, like whatever trendy flavor of “silver bullet” is in vogue, like SAP, or perhaps some esoteric middleware that forms the crux of critical company appllications, but a dearth of knowledgable individuals. A better strategy would be to garner that shiny “Project Managment Institute” diploma certificate or landing a post managing an offshore team or cleverly jockeying into an architect position where it really doesn’t matter (again, in the eyes of the CIO leadership ranks), if you ever coded a lick.

  20. Snobbery provokes snobbery, I guess…

    My point to tlack was, Naum, that instead if opting for one technology, I am a jack of all trades, like you. And it’s that ability that enables us to make good money for a long time to come whereas those that so snobbishly say: “this is the best, everyone uses it, I can’t be bothered with anything else and you and your silly postgres are out of a job” usualy do only for a while. If somebody wants to pay me this much to do PHP+MySQL, I’ll take it, I don’t care. But that doesn’t make PHP a better language or MySQL a better database.

    PS: just read tlack’s comment again. About me being on the “losing, unpopular camp”. You know what’s funny about the MySQL vs. Postgres debate? It’s that nobody ever seems to come up with a good reason why MySQL is better than Postgres, other than “more people use it, there are more tools and more books”. But never a solid, technical reason. When you point out a flaw, the response is always: “oh, you don’t really need that anyway”, “it will have that in the next version” and “it’s not _that_ bad!”. Funny.

  21. Quoting from the PHP.net function reference:
    http://us2.php.net/manual/en/ref.oci8.php
    LXXXVI. Oracle 8 functions

    These functions allow you to access Oracle9, Oracle8 and Oracle7 databases. It uses the Oracle Call Interface (OCI).

    This extension … supports binding of global and local PHP variables to Oracle placeholders, has full LOB, FILE and ROWID support and allows you to use user-supplied define variables.

    Also, Phil – I’d be curious to see your opinion of how Postnuke/PHP-Nuke etc stack up against the ACS

  22. “Compare to Java and C where typing until your fingers fall off usually doesn’t result in much of anything. SQL, Lisp, and Haskell are the only programming languages that I’ve seen where one spends more time thinking than typing.”

    Python.

Comments are closed.