Google and the future of the RDBMS

Philip Greenspun's Homepage : Philip Greenspun's Homepage Discussion Forums : Ask Philip : One Thread

I've been reading and re-reading this white hot article about Google's technical infrastructure by Rich Skentra. He makes a few key points, which I have felt free to embellish and otherwise distort:

Google can afford to do -- and seems to be doing -- Bell Labs-style "pure" CS R&D to actually expand upon and in some cases challenge the 1970s technical state-of-the-art in storage, OS and server architecture that for the most part has defined most large-scale Web applications;

Google has staked its success on seemingly minor incremental improvements, like highlighting your search terms in the results summary, that nevertheless require orders of magnitude increase in processing, memory or storage power;

Google has constructed an impressive computing platform to accomplish this, which is essentially one massive computer;

The critical differentiator for this computer platform is its ability to integrate hundreds of highly unreliable component computer parts and thus cheaply add CPU cycles

To illustrate this last point, Skrenta, who used to work at AOL/Netscape, tells an interesting story:

"In a previous job I specified 40 moderately-priced servers to run a new internet search site we were developing. The ops team overrode me; they wanted 6 more expensive servers, since they said it would be easier to manage 6 machines than 40.

"What this does is raise the cost of a CPU second. We had engineers that could imagine algorithms that would give marginally better search results, but if the algorithm was 10 times slower than the current code, ops would have to add 10X the number of machines to the datacenter. If you've already got $20 million invested in a modest collection of Suns, going 10X to run some fancier code is not an option.

"Google has 100,000 servers. "

Google's architecture of many cheap computer boxes running free operating systems seems to be both an extension of trends that were percolating prior to Google (recall the old Slashdot saw "imagine a Beowulf cluster of these ..." and, if you were at the Windows 2000 launch, recall the conference hall full of Dells we were told would soon be powering Hotmail) and at the same time a very visible and clear advantage for the company over its competitors. I'm sure there are system architects looking at what Google has done and thinking of trying something similar (as opposed to buying another Sun or HP box).

My question is: Can the RDBMS as we know it today extend adequately to be used in Web applications distributed across hundreds or thousands of individual CPUs? You write in the Internet application workbook:

"It turns out that the CPU-CPU bandwidth available on typical high-end servers circa 2002 is 100 Gbits/second, which is 100 times faster than the fastest available Gigabit Ethernet, FireWire, and other inexpensive machine-to-machine interconnection technologies.

"Bottom line: if you need more than 1 CPU to run the RDBMS, it usually makes most sense to buy all the CPUs in one physical box.

If people start building more Google-style backends, will they be able to use traditional RDBMSs or will they, like Google, have to start rethinking the viability of this 1970s technology?

Or, to invert the question, will Google be able to use its platform as it transitions from infrequently updated datasets (Web index, Usenet archive) and occasionally-updated datasets (news aggregation, shopping bot) to datasets that are both complex and mutating in real-time (e mail, social networking systems, weblogs)? If so, do you have any clue how Google might create an RDBMS-style system for its platform?

-- R Tate, April 8, 2004

Answers

A profound question. The very first system to look at links among Web sites was developed by Ellen Spertus, then a graduate student at MIT but doing her research at University of Washington because nobody at MIT understood what an RDBMS was. She made it possible to use SQL to ask questions such as "show me all the sites that link to http://www.photo.net" or "show me all the sites that are linked to by at least 10 other sites". So in some sense Google has a heritage in an RDBMS-based system.
The RDBMS was developed for precious data. Google and similar massive Internet systems deal in non-precious data. Google need not care if they lose thousands of updates from an evening's crawling or if some of their servers are several updates behind and give slightly different results than their most up-to-date server. So they're free to do a lot of stuff that people building a bank transaction processing system aren't free to do.
The RDBMS is also all about making it easy and reasonably efficient to ask new and unanticipated questions. Most Web applications have a very constrained interface and therefore a very limited number of questions that can be asked. So again it would be very wasteful to use an RDBMS in a performance-critical server farm such as Google.
The RDBMS is all about making sure that average quality programmers on tight schedules don't make terrible mistakes in managing concurrency. An organization with brilliant programmers and longer development schedules might be able to manage concurrent updates at a much lower cost in performance and hardware.
It might be a mistake to look at the most challenging IT problems as generic examples. If you said "I'm not going to build an accounting system unless it can solve all the problems faced by General Motors" you'd never build QuickBooks. It would also be a mistake for most people to say "My computation problems are tough so I want to get the same setup as those IBM genetic researchers or Google."

-- Philip Greenspun, April 9, 2004