Web Operationsreviewed by Philip Greenspun; July 2010
Web Operations: Keeping the Data On Time, a new O'Reilly book, arrived in last week's mail. It tries to set forth all of the things that one needs to do in order to keep an Internet application available to users. I tried to do this myself back in 1998, with a 10-page Web article still available at http://philip.greenspun.com/wtr/arsdigita-server-architecture. It is interesting to compare the efforts and note the changes wrought by 12 years of feverish development.
Cloud services are something new and a lot of the articles in the book show how they've been useful in at least the chapter author's site. One open question that the book does not answer is to what extent it is possible to build a complete service using only the Google App Engine, for example (reader comments would be welcome here; is it sensible to build a basic online community using Google App Engine?).
One thing that hasn't changed is the a person needs a huge variety of skills in order to keep a Web service online. Theo Schlossnagle discusses this eloquently in an opening chapter "Web Operations: The Career". It is sad to reflect that a person who had just invested four years and $200,000 in a university degree would be hopelessly ill-equipped to embark on such a career. As far as I know, there are no schools where a student would simultaneously learn about firewalls, load-balancing, RDBMS programming and administration, tracking and logging, etc.
Eric Ries contributes a good chapter promoting continuous deployment, i.e., pushing new releases of an Internet application out to customers frequently. Back in the 1990s we successfully competed against Microsoft in building an online community development toolkit. They had all of the money and programmers, but we released every 4-6 weeks and they released every year or two. So we were much quicker to learn what features users and publishers required.
Adam Jacob writes an "Infrastructure as Code" chapter that convincingly suggests one should have programs that will automatically rebuild one's services, perhaps using cloud computers such as Amazon's EC2, in the event of a disaster.
The chapter that best illustrates the three-steps-forward-two-steps-back nature of software is on the relational database. This is by Baron Schwartz, an expert on MySQL, the RDBMS of choice for many popular sites. Schwartz notes that most Web programmers don't understand how to use SQL or what the RDBMS is doing underneath. This to him is a minor flaw, perhaps comparable to a Diet Coke addiction. To me, the only significant aspects of Web development are data model, SQL queries and transactions, and page flow design. The rest is just glue code that marries HTML templates to information pulled from the database. Thus a programmer who is incompetent with SQL is an accident waiting to happen (see an earlier post regarding a Ruby on Rails wizard).
Schwartz points out that most Web services don't have a monster-sized database or a monster-sized user load. The RDBMS can thus be served by a medium-sized rack-mounted machine of some sort. I thought that he would next say "So you might as well make heavy use of best-practice RDBMS programming that has yielded so many benefits in terms of data integrity and protection against code monkey incompetence for big companies." But instead he suggests the following: use MySQL; don't use foreign key constraints; don't use triggers; don't use views; don't use stored procedures. Without a foreign key constraint, for example, Mr. Schwartz's database will happily store discussion forum postings from users who have been deleted. Schwartz says that MySQL is such a piece of junk that using any of the features traditional RDBMS programmers have relied on for decades will result in the failure of replication, backups, and other critical parts of the system.
How about manipulating data inside MySQL? That's also a nicht-nicht. Schwartz says "use sed, awk, sort, and uniq". How about relying on the RDBMS's extensive in-memory cache of table information? Schwartz says don't talk to MySQL too often because it might fall over; start out with Memcached in front. Finally, he notes that most publishers can accept some data loss and inconsistency. It is perhaps a blessing that E.F. Codd died before he could read this chapter.
Heaping more abuse on the ashes of everything that RDBMS programmers have held sacred is a chapter by Eric Florenzano suggesting the use of key/value pair databases such as Berkeley DB and its descendants. Giving up SQL and the ACID guarantees is often worth it for distributed storage and processing.
The 300-page book raises more questions than it answers, but I think it is important because these are important questions for most Internet application developers and maintainers to ask.
More: read the book.
This book contains nearly 100 figures and the reader will wish to skip/skim some chapters as well as flip back and forth. As such, it is better-suited to hardcopy than an electronic reader.
It's amazing to me how many CompSci graduates have not taken a DBMS course, or if they have still can't accurately describe the different types of Joins and when they would be used. Forget about a networking / security class.
It's equally amazing that in 2010, it is still really hard to build a reliable, secure, flexible, easy to manage web site.
-- David Wihl, July 12, 2010
It seems to me that frustration with MySQL is a part of this move towards NoSQL data stores. Adding to this is the popularity of the Ruby on Rails framework which happily abstracts the data store and applies its own relation algebra to help developers to completely ignore "relational" (and transactions too). I once attended a Rails talk where the speaker said they had to switch to CouchDB because the database couldn't handle and was not designed for the high frequency queries his application was executing. I imagined this was in part because the application code needed to use the data dictionary to figure out what the data store looked like - on every call. It's sort of implied by the name of the DB object library: "Active Record".
One thing I can say about a thick database approach is that your application will weather all the GUI language fads. Maybe that should be in a chapter about on going maintenance.
Thanks for the review. I now know what not to buy.
-- Greg Jarmiolowski, July 12, 2010
I recently created a client's website on App Engine (Java). It's worked out well. It's a very basic CMS for a boutique luxury holiday home agency. Not having SQL does make for the need to think outside of the box. But if you can do that, you have a free hosting service for a small dynamic website.
The only reason I would hesitate to build my next big .com on it is that it is very hard to move away from should you want or need to.
The lack of SQL skills frustrates me enormously. I see so many dead-slow designs relying on caching to have any performance at all, simply because to get an index of articles with author name, number of comments and the first image (if any) attached to the article, the average ivy-league CS grad wouldn't know what an outer join on a derived table was if it hit them in the face. Instead they do recursive queries or duplicate (de-normalize) data.
There comes a point (Amazon, Facebook, eBay) when one big RDBMS doesn't make a whole lot of sense anymore as it just won't cope. For the rest of the worlds websites, it is still the best and easiest solution. Just take an afternoon to read a book on the subject.
And why MySQL is still so popular is simply beyond me. PostgreSQL runs rings around it in performance, is standards compliant and far more "free" than MySQL will ever be. Both free as in beer and free as in speech.
-- Bas Scheffers, July 12, 2010
@Bas Scheffers: MySQL still gets more usage than PostgreSQL because its discoverability, GUI variety, and command line usability are better. Having used both, I can tell you that I have plenty of days where I wish psql had a better command interpreter, help system, etc. Take a look at bpython, apply that thinking to psql. People will come. Unfortunately, product management is not open source's strong point. MySQL also has better GUI tools in general. I'm still waiting for some enterprising kid to make a version of Sequel Pro that talks to Postgres. For extra credit, de-bind it from Mac OS X and make it run on Linux. But don't, whatever you do, make it with WX, which IMHO is what makes PGAdmin so unbelievably bad and tedious to use, despite its power with graphical query analysis.
@philip: I've run a number of systems on google app engine, EC2, and heroku. There are so many now I can't keep track of them. For my money, I think systems like heroku will win. The ability to abstract away much of the sysadmin and devops burden, combined with clean bindings to git for both public (open source) and private (proprietary) use, in addition to rock solid reliability on top of EC2, and a plethroa of choices in terms of add-ons such as database, security, queuing, measurement, testing, etc. make it a no brainer. Whether trading the sysadmin and devops time/cost is worth the price of admission to the cloud is another matter that should probably be studied, but it is not easy to measure. All I know is, experientially, I have more fun and less hassle developing on heroku than any other platform I've used, and we run without real sysadmins or dbas, or anything like that, so in some regards, heroku is the robot that's bringing the second wave of job elimination to the US. My advice? Have a semester or two of class/seminar that deals with dev ops: sysadmin, cloud, why tools like gihub matter, test automation, how to think like a product manager; because the kids we hire will be expected to do all of this stuff, not just a narrow slice of it. What buoys my spirit is that I'm increasingly seeing a recognition in job listings that people understand the need for generalists, and that's OK, where it used to be tantamount to admitting you were a second class citizen.
-- David Watson, January 27, 2012