I remember when my Web site was new, back in the winter of '93-94. I'd just put up Travels with Samantha and every day or two I'd load the whole HTTPD access log into Emacs and lovingly read through the latest lines, inferring from the host name which of my friends it was, tracing users' path through the book, seeing where they gave up for the day.
Now the only time I look at my server log is when my Web server is melting down. When you are getting 25 hits/second for 20 or 30 simultaneous users, it is pretty difficult to do "Emacs thread analysis".
There must be a happy middle ground.
You can also refine content. By poring over my logs, I discovered that half of my visitors were just looking at the slide show for Travels with Samantha. Did that mean they thought my writing sucked? Well, maybe, but it actually looked like my talents as a hypertext designer were lame. The slide show was the very first link on the page. Users had to scroll way down past a bunch of photos to get to the "Chapter 1" link. I reshuffled the links a bit and traffic on the slide show fell to 10%.
Finally, once your site gets sufficiently popular, you will probably turn off hostname lookup (I did it when I crossed the 200,000 hit/day mark). Unix named is slow and sometimes causes odd server hangs. Anyway, after you turn lookup off, your log will be filled up with just plain IP addresses. You'll probably want to run a log analyzer on a separate machine to at least figure out whether your users are foreign, domestic, internal, or what.
For me, log analyzers break down along two dimensions:
A substrate-based log analyzer makes use of a well-known and proven system to do storage management and sometimes more. Examples of popular substrates for log analyzers are perl and relational databases. A standalone log analyzer is one that tries to do everything by itself. Usually these programs are written in primitive programming languages like C and do storage management in an ad hoc manner. This leads to complex source code that you might not want to tangle with and ultimately core dumps on logs of moderate size.
Here's my experience with a few programs...
There are a lot of newer tools than wwwstat in the big Yahoo list. A promising candidate is this Python program from Australia, but I haven't personally tried it. A lot of my readers seem to like analog (referenced from Yahoo), but again I haven't tried it.
I've been wanting to try the new release myself, but OpenMarket is so far ahead of the pack in making Internet commerce a reality that they leave old-timers like me in the dust. The obstacles I faced were the following:
Of course, I went through a similar experience getting the first version of WebReporter downloaded and installed. So for the moment I've given up on the product. My main reservation about WebReporter in particular is that it seems to require 50 or 60 phone calls/year plus money to keep it current. I've managed to serve about 700 million hits with server programs from Netscape and NaviSoft (now AOL) without really ever having to call either company.
My experience with WebReporter has made me wary of standalone commercial products in general. Cumulative log data may actually be important to you. Why do you want to store it in a proprietary format accessible only to a C program for which you do not have source code? What guarantee do you have that the people who made the program will keep it up to date? Or even stay in business?
So brilliant and original was my thinking on this subject that net.Genesis guys apparently had the idea a long time ago. They make a product called net.Analysis that purports to stuff server logs into a (bundled) Informix RDBMS in real time.
Probably this is the best of all possible worlds. You do not surrender control of your data. With a database-backed system, the data model is exposed. If you want to do a custom query or the monolithic program dumps core, you don't have to wait four months for the new version. Just go into SQL*PLUS and type your query. Any of these little Web companies might fold and/or decide that they've got better things to do than write log analyzers, but Oracle, Informix, and Sybase will be around. Furthermore, SQL is standard enough that you can always dump your data out of Brand O into Brand I or vice versa.
Caveats? Maintaining a relational database is not such a thrill, though the fact that your Web site need not depend on the RDBMS being up (you can log to a file and then do a batch insert) makes it easier. I haven't tried the net.Genesis product myself, but I have built custom sites that use the NaviServer API to log into Oracle and Illustra RDBMS installations. You may want to benchmark your database before going wild with the real-time inserts.
One of those "new kludges" that I've had experience with is analog. It's highly configurable as to what information you want reported, and it addresses all of the issues that Phil has with wwwstat except for non-reporting of referer and browser information: you can alias a file to be reported as another file (with wildcards, even), get distinct host information as well as stats by domain, and get it to do DNS lookups on raw addresses for you. The latest version even uses pretty GIF bargraphs.
-- Jin Choi, January 9, 1997
I lied, it even deals with all those extra logs.
-- Jin Choi, January 9, 1997
analog is by far the best freely available web stats program.
-- firstname.lastname@example.org --, February 18, 1997
Like others, I really appreciate Analog as a stats program (my stats can be viewed by anyone at http://www.ca-probate.com/stats/stats.html).
But for a huge number of web site developers, it is impossible to use a program like Analog because the users have no access to their site logs! For those users, the only way to get statistics would be to use a "counter" or "tracker" service. For a list of such services, see http://www.ca-probate.com/counter.htm
-- Mark J. Welch, March 7, 1997