A tale of two servers…
Server #1: I attended a presentation by a guy with a background in law and business, what we programmers commonly call “an idiot”. He had set out to build a site where people could type in the books, dvds, and video games that they owned and were willing to swap with others. Because he didn’t know anything about programming, he stupidly listened to Microsoft and did whatever they told him to do. He ended up with a SQL Server machine equipped with 64 GB of RAM and a handful of front-end machines running ASP.NET pages that he and few other mediocrities cobbled together as best they could. The site, swaptree.com, handles its current membership of more than 400,000 users (growing rapidly) without difficulty. A typical query looks like “In exchange for this copy of Harry Potter, what are the 400,000 other users willing to trade? And then for each of those things, what else might some third user (chosen from the 400,000) give in a three-way trade.” This kind of query is done many times a second and may return more than 20,000 rows.
Server #2: A non-technical friend hired an MIT-educated software engineer with 20 years of experience to build an innovative shopping site, presenting Amazon-style pages with thumbnails and product descriptions. Let’s call my friend’s site mitgenius.com. The programmer, being way smarter than the swaptree idiot, decided to use Ruby on Rails, the latest and greatest Web development tool. As only a fool would use obsolete systems such as SQL Server or Oracle, our brilliant programmer chose MySQL. What about hosting? A moron might have said “this is a simple site just crawling its way out of prototype stage; I’ll buy a server from Dell and park it in my basement with a Verizon FiOS line going out.” An MIT genius, though, would immediately recognize the lack of scalability and reliability inherent in this approach.
How do you get scale and reliability? Start by virtualizing everything. The database server should be a virtual “slice” of a physical machine, without direct access to memory or disk, the two resources that dumb old database administrators thought that a database management system needed. Ruby and Rails should run in some virtual “slices” too, restricted maybe to 500 MB or 800 MB of RAM. More users? Add some more slices! The cost for all of this hosting wizardry at an expert Ruby on Rails shop? $1100 per month.
For the last six months, my friend and his programmer have been trying to figure out why their site is so slow. It could take literally 5 minutes to load a user page. Updates to the database were proceeding at one every several seconds. Was the site heavily loaded? About one user every 10 minutes.
I began emailing the sysadmins of the slices. How big was the MySQL database? How big were the thumbnail images? It turned out that the database was about 2.5 GB and the thumbnails and other stuff on disk worked out to 10 GB. The servers were thrashing constantly and every database request went to disk. I asked “How could this ever have worked?” The database “slice” had only 5 GB of RAM. It was shared with a bunch of other sites, all of which were more popular than mitgenius.com. Presumably the database cache would be populated with pages from those other sites’ tables because they were more frequently accessed.
How could a “slice” with 800 MB of RAM run out of memory and start swapping when all it was trying to do was run an HTTP server and a scripting language interpreter? Only a dinosaur would use SQL as a query language. Much better to pull entire tables into Ruby, the most beautiful computer language ever designed, and filter down to the desired rows using Ruby and its “ActiveRecord” facility.
Not helping matters was the fact that the sysadmins found some public pages that went into MySQL 1500 times with 1500 separate queries (instead of one query returning 1500 rows).
In reviewing email traffic, I noticed much discussion of “mongrels” being restarted. I never did figure out what those were for.
As the MIT-trained software engineer had never produced any design documentation, I could not criticize his system design. However, I suggested naively that a site with 12.5 GB of data required to produce customer pages would need a server with at least 12.5 GB of RAM ($500 retail for the DIMMs?). In the event that different customers wanted to look at different categories of products, it would not be sufficient to have clever indices or optimized queries. Every time the server needed to go to disk it was going to be 100,000 times slower than pulling data from RAM.
My Caveman Oracle/Lisp programmer solution: 2U Dell server with 32 GB of RAM and two disk drives mirrored. No virtualization. MySQL and Ruby on Rails running as simultaneous processes in the same Linux. Configure the system with no swap file so that it will use all of its spare RAM as file system cache (we tore our hair out at photo.net trying to figure out why a Linux machine with 12 GB of RAM, bought specifically to serve JPEGs, would only use about 1/3rd for file system cache; it stumped all of the smartest sysadmins and the answer turned out to be removing the swap file). Park at local ISP and require that the programmer at least document enough of the system that the ISP’s sysadmin can install it. If usage grows to massive levels, add some front-end machines and migrate the Ruby on Rails processes to those.
What am I missing? To my inexperienced untrained-in-the-ways-of-Ruby mind, it would seem that enough RAM to hold the required data is more important than a “mongrel”. Can it be that simple to rescue my friend’s site?
[August 2009 update: The site has been running for a couple of months on its own cheapo Dell pizza box server with 16 GB of RAM. The performance problems that the Ruby on Rails experts had been chasing for months disappeared and the site is now responsive.]
[March 2010 update: My friend is now in discussions with some large companies interested in using his technology or service. Andrew Grumet and I published “Software Design Review” as a result of this experience.]