A friend of mine who works at a database management system company asked for thoughts on how long a string a database table needs to be able to store, as a practical matter, to serve most Internet programming needs. This prompted me to do some queries into the photo.net discussion forum. Here’s my message to him, which I thought would be interesting to nerd readers….
Three basic issues for Web development relating to varchar/clob:
1) Strings are uploaded from a browser-rendered TEXTAREA are of a length that is impossible for the programmer to predict. In a sense, then, every text slot in the database must be prepared to accept a string of arbitrary length.
http://www.cs.tut.fi/~jkorpela/forms/textarea.html#browserlimits reveals that some browsers have limits of 32K or 64K but that as Microsoft and Mozilla get more sophisticated these seem to be disappearing.
2) Software developers of Internet applications are often first-time SQL programmers and sometimes first-time programmers altogether. Unless a DBMS can make CLOBs work with every SQL function and command these novice programmers must learn a whole new computer language, essentially, to deal with CLOBs.
3) Internet applications are often developed using feeble ad hoc tools, such as PHP (my students in 6.171, MIT’s Software Engineering for Internet Applications, mostly picked PHP to do their semester project even though I would have discouraged this, being a mistruster of thrown-together unnecessary new languages). Many of these tools don’t have facilities for dealing with anything beyond the basic SQL data types so they couldn’t use CLOBs if they wanted to.
I think for Web development it is reasonable to expect the average string length from the user to be 300 chars, despite wanting to be prepared for a maximum of 32K or even larger. Oops. Typing that prompted me to do the query (see below). Averaging 2 million messages on photo.net, the correct number is 425. The histogram query reveals that out of 2 million messages over a 10-year period a 32K limit would have resulted in 6 messages being rejected and a 16K limit something like 30 rejections.
If you wanted to implement something like Salon.com as a single RDBMS table for both articles and comments on articles I think a 64K limit might be required. If someone authors a 5000-word magazine article in Microsoft Word and then saves as HTML that will be 25-30k of content plus at least a factor of 2 in HTML tags and other Microsoft-added filler.
http://www.photo.net/bboard/q-and-a-fetch-msg?msg_id=002oFh is an example of one of the big postings on photo.net. It is 37962 chars long, the HTML is very clean (i.e., much less filler than if saved by Word), and yet the page doesn’t seem excessively long.
So… my conclusion from looking at the queries below is that 32K would do the job for a pure discussion forum system and that it would be marginal for storing articles unless a publisher decided that everything should be broken up into “part 1”, “part 2”, and “part 3”. Looking at the .html files on photo.net, the vast majority are in fact under 32K and most are under 64K.
However, if you look at
http://philip.greenspun.com/seia/ the largest chapters are 88.7k.
If you want to facilitate novice programmers building full-scale content management systems where all the content is uploaded from browsers it might be necessary to make the varchar datatype bloat up to 100k-ish. But 32K would be adequate to build something like eBay, amazon (user-uploaded content and much of the publisher content as well), or photo.net discussion forums.
———– some stats from photo.net
select avg(dbms_lob.getlength(MESSAGE)),count(*) from bboard;
AVG(DBMS_LOB.GETLENGTH(MESSAGE)) COUNT(*)
——————————– ———-
424.672669 2052290
select round(dbms_lob.getlength(MESSAGE),-3), count(*)
from bboard
group by round(dbms_lob.getlength(MESSAGE),-3);
ROUND(DBMS_LOB.GETLENGTH(MESSAGE),-3) COUNT(*)
————————————- ———-
0 1452035
1000 510236
2000 58330
3000 11972
4000 3399
5000 1303
6000 481
7000 264
8000 138
9000 93
10000 66
11000 38
12000 22
13000 21
14000 17
15000 6
16000 12
17000 9
18000 14
19000 4
20000 4
21000 1
22000 2
23000 1
24000 4
25000 3
26000 2
27000 1
30000 2
32000 1
33000 2
34000 1
37000 1
38000 1 Note: Oracle did a much better job formatting these in SQL*Plus; for some reason the tabs didn’t carry through after cutting and pasting.
——————————
Epilogue (not from my email to the friend):
Look how much fun it is to program SQL. Three lines of code and you get an interesting answer (and those three lines would have been much cleaner and simpler if we hadn’t been forced to use the CLOB datatype, which has its own strange accessor functions). Compare to Java and C where typing until your fingers fall off usually doesn’t result in much of anything. SQL, Lisp, and Haskell are the only programming languages that I’ve seen where one spends more time thinking than typing.
Full post, including comments