Using CVS for Web development

Head of the Charles Regatta, Sunday, October 18, 1998. From the footbridge to Harvard Business School

Using CVS for Web development

by Philip Greenspun for Web Tools Review

100th Anniversary Boston Marathon (1996).

If you have a very clear publishing objective, specs that never change, and one very smart developer, you don't need version control. If you have evolving objectives, changing specifications, and multiple contributors, you need version control.

The Solution

three Web servers (can be on one physical computer)
two Oracle users/tablespaces (can be in one Oracle instance)
one Concurrent Versions System (CVS) root
two people trained to understand CVS

Let's go through these item by item.

Item 1: Three Web Servers

Suppose that your overall objective is to serve a Web service accessible at "foobar.com". You need a production server, rooted at /web/foobar (Server 1). You don't want your programmers making changes on the live production site. That's sort of the whole point of this document. So you need a development server, rooted at /web/foobar-dev/ (Server 2). You might think that this is enough. When everyone is happy with the dev server, have a code freeze, test a bit, then copy the dev code over to the production directory and restart.

What's wrong with the two-server plan? Nothing if you are running photo.net circa 1997. The development team consisted of me and Jin. The testing team... me and Jin! Note that there was no possibility of simultaneous development and testing. ArsDigita.com customers, however, usually have enough budget to pay for four or five programmers plus 20 or 30 internal staffers who may be updating content, testing changes, and sometimes contributing code. For a complex site, the publisher may wish to spend a week testing before launching a revision. It isn't acceptable to idle authors and developers while a handful of testers bangs away at the development server. The solution? A staging server, rooted at /web/foobar-staging/ (Server 3).

Here's how the three are used:

developers work continuously in /web/foobar-dev/
when the publisher is mostly happy with the development site, a named version is created and installed at /web/foobar-staging
the testers bang away at the /web/foobar-staging server
when the testers and publishers sign off on the staging server's performance, the site is released to /web/foobar/ (production)
any fixes made to the staging server are merged back into the development server

Item 2: two Oracle users/tablespaces

Suppose that you have a working production site. You could connect your /web/foobar-dev/ to the production Oracle user. After all, Oracle's raison d'être is concurrency control. It will be happy to run eight simultaneous connections to your production site plus two or three to the development server. The fly in this ointment is that one of your developers might get a little sloppy and write a program that sends drop table users rather than

drop table
users_experimental_extra_table

to the database.

So it would seem that we'll need at least one new Oracle playground. Here are the steps:

create a new Oracle user and tablespace, named "foobardev" (assuming the production user is "foobar")
import a recent Oracle export.dmp file to populate your tablespace with what was on the production site (if you're following the tenets of the ArsDigita Server Architecture you'll always have one from the previous night anyway). Cry with pain as you discover that Oracle imports don't work with LOB columns unless you're importing into an installation that has a tablespace with the same name as the one from which the tables were exported.
every time you alter a table, add a table, or populate a new table, record the operation in /web/foobar-dev/www/doc/sql/patches.sql
when you're ready to move from staging to production, hastily apply all the data model modifications from patches.html to the production Oracle user

Shouldn't we have three Oracle users? One for dev, one for staging, one for production? No. It usually isn't worth it. Adding a column to a relational database table seldom breaks queries. Until Oracle 8.1.5, you weren't able to drop a column. And anyway the radical data model changes tend to take place when a site has yet to be launched.

The bottom line is that it takes work to keep three Oracle users' objects in sync. It is half as much work to sync two and almost as useful. How to deploy these two Oracle users? Park one behind the production server. Use the other one behind the dev and staging servers.

Item 3: one Concurrent Versions System (CVS) root

Rhya Fisher at Harvard Bookstore. Cambridge, MA 1998.

The Concurrent Versions System (CVS) is a powerful file system-based tool that can do the following things:

remember what all the previous checked-in versions of a file contained, using its repository
show you the difference between what's in your tree and what's in the repository
help you merge changes made simultaneously by multiple authors who might have been unaware of each other's work
group a snapshot of currently checked-in versions of files as "Release 2.1" or "JuneIssue"

CVS is free and open-source.

CVS does all of this via its repository or "CVS root". This is a directory, typically /usr/local/cvsroot/. Most Unix machines don't have enough space in the /usr partition to store all Web content. Remember that the CVS root will be at least as large as all of the files under source control. Thus we will use /cvsweb as our CVS root and, if need be, migrate it to a separate disk subsystem.

Create a project from your development Web sources (from /web/foobar-dev/) so that they will end up at /cvsweb/foobar/.

Item 4: Two Trained CVS Users

Don't plan to teach all of your contributors the arcana of CVS. The ones who use GNU Emacs will need to learn to type c-x c-q and c-c c-c to contribute change comments. But the contributors who use primitive tools (FTP, HTTP PUT, vi) can remain blissfully unaware of the fact that CVS is in use.

Who is really using CVS then? A cron job. Every day just before midnight the cron job should check in all changes from the dev server to the main branch, with the change comment 'nightly check-in YYYY-MM-DD'. The cron job should notify the Release Master if any files that are in the repository have been deleted so that he or she can decide whether the removal was a mistake or if typing cvs remove is warranted (the files don't really go away; they go into an "attic").

One person is designated the Release Master. Normally this person does nothing. When the publisher is happy with the behavior of the development server, the Release Master creates a CVS branch named "199909Launch" or whatever. The Release Master updates the staging server from CVS with this branch. Development proceeds with checkins to the main CVS branch. These won't affect updates from the 199909Launch branch.

Once the staging server has been thoroughly tested, the Release Master checks in any changes that have been made. The check-in happens twice, once to the 199909Launch branch (there won't be any conflicts since nobody has been touching this) and once to the main branch (conflicts may need to be resolved).

When the publisher decides to go live, the Release Master takes the following steps:

manually update the /parameters/foobar.ini file as necessary
update production server from the CVS branch 199909Launch
apply any data model changes (quickly!) from /doc/patches.html.

If there are significant data model changes, do this in the middle of the night and consider bringing up a "comebacklater" server for a few minutes!

If the Release Master is doing all of this hard work, why do we need to train anyone else in CVS? A Web service is 24x7 but one person can't work 24x7. So we need a Release Apprentice for each Web service who knows everything that there is to know about this system.

Exactly which directories do we control?

A programmer's intuitions about which directories to control will generally be 180-degrees off. For example, a programmer might think that it isn't worth controlling graphics files. After all, CVS can't really do much with these besides compare them byte by byte and tag them with dates.

The ArsDigita Community System generally contains the following under /web/foobar:

/www -- the main Web server root; we must control this
/tcl -- private Tcl library; we must control this
/parameters -- server personality; we'd like to control this but we can't unless we're careful to make sure that each server has a uniquely named auxconfig .ini file, e.g., foobar.ini, foobar-dev.ini, and foobar-staging.ini. Remember that, if nothing else, the server name in each section of this .ini must be different (e.g., "foobar" and "foobar-dev"). So it would be disastrous to update the production server's aux .ini file with the dev server's aux .ini file.
/bin -- email handling scripts forked by the mailer (generally qmail); no real reason to control this unless you're running dev and production on separate computers
/templates -- for sites with fancy graphics... the fancy graphics; we must control this
(most servers) misc directories containing files uploaded by users, not kept under the Web server root due to security concerns; can't control this or we risk rolling back months of user uploads!

The bottom line is that it would be nice to just say "all of /web/foobar-dev" but we can't do this unless we're careful with the auxconfigdir (/parameters) and make sure to keep user-uploaded files out of the /web/foobar/ directory.

Do you need a farm of big fancy servers to implement this?

A farm in Alberta, on the way to Calgary from Montana

How big and how many computers do you need to adopt the procedures described in this document? Three Web servers, two Oracle users, the CVS package, ... Sounds complicated. Actually you can run it all on a $2000 Linux box.

If you're worried about your developers being sloppy and editing files in /web/foobar/ when they thought they were in /web/foobar-dev/ remember that you can always use cvs update to revert the production site to the most recent approved version.

Suppose that you've ample money for server hardware, co-location fees, and sysadmin resources. You probably want to split the production machine out and only give the Release Master and Release Apprentice access to that box. Let the developers and staging/testing folks fight it out on a development server.

Why not one development area per developer?

Classically, CVS is used by C developers and each C programmer works from his or her own directory. This makes sense because there is no persistence in the C world. You compile your code, run a binary that builds data structures in RAM and when the program terminates it doesn't leave anything behind (except maybe a core file). Checking out a CVS tree and working on it isn't a big deal.

Compare this to the world of db-backed Web servers. If you want to check out a copy of the tree and play with it, you have to create an Oracle user and tablespace, import a recent Oracle export.dmp file to populate your tablespace with what was on the production site, find a free IP address or port and set up a Web server, and then keep your Oracle table definitions in sync with any alterations other developers may be making.

In the C world, developers live to satisfy themselves. More than likely, not another soul on the planet will ever run the code that they are authoring. So it is fine for them to work alone. In the Web world, developers always work with the publisher and users. Those collaborators will need to be alerted to this new server so that they can offer criticism and advice. They might need special passwords or firewall access since most publishers don't like to let the public see their unfinished development efforts.

In the C world, you've got the luxury of one or two years between product releases. All the work is done by people with at least four years of training. In the Web world, a significant new release may need to be produced in four weeks. Much of the work may be done by people with no formal training of any kind, e.g., designers and content authors editing templates or static .html pages. Given the chronic shortage of personnel in this industry, do you want to limit yourself to being able to hire only those who've been through a CVS training course? To those who are formally minded enough to read the CVS man pages? Remember that most of the contributors on your site will not be programmers.

The bottom line? It is just too much work to set up each contributor with his or her own little server.

Good Things About This System

To end this article on a positive note, let's summarize the good things about this system:

if something is screwy with the production server, you can easily revert to a known and tested version
a programmer who is a trained CVS user can protect and comment his or her changes by explicitly doing a cvs checkin
a contributor who is ignorant of CVS is protected by the nightly cron job against losing more than one day of work

MarkD's guide at http://www.badgertronics.com/writings/cvs/index.html
Concurrent Versions System manual
Cyclic Software, a commercial software
CVS Version Control for Web Site Projects by Sean Dreilinger. Well-written but more applicable to Web sites that are static HTML files (you can download a whole tree to your local file system and play around without worrying about Oracle users).
Managing websites using Unix by Nik Clayton. Clayton basically proposes the same architecture as I do: dev, staging, production (though he does not address the Oracle issue).
CVSweb, a way to look at a CVS tree (and versions of files) from any Web browser
Open Source Development With CVS (Fogel 1999; Coriolis)
one chapter in Learning Gnu Emacs explains how to use CVS conveniently from within Emacs (also explained in Mark D's guide)

philg@mit.edu

Reader's Comments

If you are setting up a new cvs server, spend a few extra minutes to configure CVS using the client-server ("pserver") mode, instead of the older file system mode. This will save you pain later and may keep you out of hot water. Pain, because moving the repository (your old one dies, your company IPO's and your boss wants to buy a big fancy server farm, you want to hide the repository behind a firewall) is matter of changing an environment variable. You get immedieate access control (developers can be protected from updating the production environment). CVS in file system mode can "hang" because it leaves a lock file around for each file and directory. Then you need a cvs guru to dive in and fix it. One note: you can't live in a mixed environment. It is either one mode or the other.
An expert tip on using client server: CVS uses gzip for compressing data across the network. The default setting is -z3 which is a pitiful waste of time. Recompile CVS to use -z9 by default (the network is the bottleneck, not CPU resources), or add it to everyone's .cvsrc configuration file (it lives in the users' home directory).
I've had some extremely painful experiences with CVS and large binary files. (Large is +32Mb) When CVS checks a file out of the repository, even if it is doing nothing more than a straight copy (no diff'ing, merging, etc.) the program brings the whole file into contiguous memory. This bloats the CVS process resident set size to at least the size of the file, +6Mb for the program, give or take. The process is inefficient, so subsequent large files don't reusue the space well. CVS bloats even more. Make sure that your server is configured with a lot of swap space (it should have a lot of memory anyway). Even so, performance will drag down into the ground until CVS is finished (could be 30 minutes for a large working set), then things will "mysteriously" return to normal.

-- Ken Mayer, July 23, 1999

Your proposed once per day automatic check-in of everything is a nice idea for a group such as your ArsDigita companay with it's fairly non standard mission statement.
In more mundane companies however you usually have at least one mid-lewel manager who will see the amount of code checked-in every day as a measurement of individual emplye efficiency, and wrech all sorts of havoc with this misguided "knowledge".
I'm sure some of you have expierenced mid-level managers who were too dump to even figure out how to do this, but I have never been that unlucky ;-)
Apart from this your proposed method sounds remarkely similar to what I have been doing for various db backed websites over the last few years. It has proven itself to me to be a great time saver and I don't even want to calculate how many near disasters with their associated all night fix-up sessions it has saved me or my co-workers from.
The pserver is surely the only way to share CVS among a group of people without running into all sorts of non-interesting problems with nfs etc. You can also tunnel it through ssh for secure over-the-net operations.

-- Kristian Elof Sxrensen, July 24, 1999

Regarding putting the stuff in /parameters - the .ini files - under CVS, and requiring different .ini files for your three servers: this is a darn good reason to use Tcl configuration files in AOLserver 3.0 instead of .ini files. Then config file can use Tcl to determine whether it's a production, dev, or staging server (based on an environment variable, or the server home, etc.), and use the appropriate config values.

-- Rob Mayoff, February 26, 2000

Although my company does not use CVS, we have used Microsoft's Visual Source Safe and Intersolv's PVCS Version Manager. Both were a pain to setup and have people use them. All complains usually go away after the first time that version control saves your day after some screwup.
As for the managers, they usually don't care. Obviously some misguided soul is going to use this tool to gather information on who worked on what and for how long, but around the office 99% of the people are interested in it because it saves us from many headaches.
I don't think I ever want to work on a project without some kind of version control.

-- Pedro Vera-Perez, March 14, 2000

When dealing with large teams of developers using CVS can be a real headache. One alternative would be BitKeeper which solves most (if not all) CVS's problems. It was written by the guys that did SUN's TeamWare's source management system.

-- Petru Paler, April 18, 2000

I know the above is 7 years old, and version control has been gaining acceptance all the time, so what I want to add may already be obvious to most readers.
First, everything I do that is worth saving is under version control, either in CVS, or (preferably) in Subversion, which is short for "CVS with the glaring problems fixed". A version control server is best regarded as part of the regular IT infrastructure, like a file server, a webserver or a mail server. The university department where I work (Math & CS) maintains a Subversion server for all employees to use; it's very popular and works very well. Used mainly for source code, websites, and scientific papers.
Second, version control can be thought of as a tool for collaboration, but I use it more for structuring my own work. I commit my changes into version control whenever they represent some meaningful unit of change: not sooner, not later. So my commits usually correspond to specific tasks, with specific objectives that the changes were designed to achieve, and specific results - objectives met, objectives failed, new issues found. To get a task-oriented overview of the work I did on a project in the past year, I just read back the log of messages I typed with each commit. To refresh my memory on why a particular change was made, I look it up in version control and read the accompanying commit message.
It also works the other way: I can set objectives and start making the necessary changes until I have acceptable results, knowing that if half way through things turn out uglier than expected, I'll just revert to the last committed version and start over, or postpone the objective in question.
So version control really helps me structure my work. It structures my work into transactions. This benefit is lost with automated commits. The one conceptual hurdle to learning how to work with CVS or Subversion is that users must learn to think of all their edits as being parts of transactions that need to be explicitly committed or rolled back. Once they learn this it can be a big asset.

-- Reinier Post, June 17, 2006

Add a comment