RDBMS versus NoSQL article drafted

Folks:

We’ve drafted an RDBMS versus NoSQL article for the three-day RDBMS course that we’re teaching this weekend at MIT (you’re welcome to take the class, even if you’re not an MIT student; just email me). The article is intended to support a 20-minute discussion and give students pointers should they be interested in learning more. As neither Andrew nor I are NoSQL experts, comments would be appreciated.

Thanks.

18 thoughts on “RDBMS versus NoSQL article drafted

  1. “288,000 users per day (assuming that the load is smooth, which is probably reasonable, so divide by two”
    Don’t you mean “unreasonable”?

    Other than that I’d possibly focus some more on how to make the RDBMS go further before requiring an “all NoSQL” solution. Right now that is a spread out in the last paragraph of “Global Reach” and the memcached description.

  2. As a long time reader of all of your books (multiple times!) here are some thoughts:

    1) In previous writings you would reference vertical scaling, such as what eBay did with multi-cpu Sun servers. While you elude to the cheap $20k server option for scaling, you don’t say anything about current vertical scaling options. If you poke your head around enough startup CTO blogs (not from the 5 big sites with $500MM+ in VC, but all the little ones), you’ll notice a common thread: “We’re experimenting with XYZ nosql solution combined with these 3 different programming languages and this cool query language we found”. The minute someone is investing some money in these guys someone has to address the question of how far $500k-1MM+ on ONE database server would get the company. It’s the same or less cost as paying a bunch of engineers that have never used a given beta-grade nosql system to spend a year figuring it out, but theres much less risk and the benefit is immediate. I suspect that a lot of people’s 40 EC2 instances could be rolled down into 1 mega-big DB server, some memcached servers, and web servers. The time saved can be used for adding features or refining the UIX for various conversion factors.

    2) If you look at most of the “How’d We Get Here” presentations for popular sites like Facebook, WordPress, and FlickR you see a common trend: They start with 1 server, then split DB/Web, then do a bunch of Web, etc etc. Then they start doing MySQL replication. Only after 50-100 deployed machines did they start to do different things. I think for some of them (FB) it may have been closer to 100s or 1000s. Perhaps it is important to have a section on the ‘Evolution’ of a project’s database architecture.

    3) I get the sense that a lot of people are using the nosql approaches for doing what is essentially offline data mining / transforming. It would be useful to point out that one should not confuse the needs for the data mining system with that of the OLTP system. Perhaps many projects would be fine with traditional RDBMs architectures and some EC2 compute clusters for random needs.

    4) What I liked about some of your previous writing re: Object Stores was when you pointed out the perceived lack of proven quality in these systems vs. RDBMs that had a much bigger market share.

    5) HandlerSocket allows you to use MySQL more like a NoSQL solution and in fact claims that it is faster than some NoSQL solutions. The benefit is that you can run this directly on your normal data and use all of the normal SQL functionality of MySQL on the data (when you’re not in ‘NoSQL mode’). I am not sure why solutions like this are not more popular, but I hope they become so. (see http://mysqldba.blogspot.com/2010/12/handlersocket-mysqls-nosql-php-and.html )

    6) It may not be clear to the reader from your description of HyperDB that it is a database abstraction *class* that does fancy things to table names and database connections, rather than being an actual database. I believe LiveJournal used to do something similar with Perl.

    7) One side effect of the push to NoSQL products is that people are left with no way to query their datasets. That’s why when you start exploring the offerings available in this market you’ll find obscure half-implemented versions of query languages that would effectively provide the same power of SQL if they were mature. There are a lot of these at the scripting language library class level (Python/Ruby).

    8) It would be helpful (time permitting) to update one of your architecture graphics from previous writings to show how something like memcached fits into the picture. Your explanation is clear enough, but something visual would probably help people not experienced with it understand the state flow between pulling a cached data object from memcached vs. having the same data object first refreshed from SQL.

    9) You might explore the web site “deployment” solutions/fads that have contributed to the popularity of some of the NoSQL approaches. For instance it is hard to scale vertically with virtualized cloud service providers such as Amazon’s EC2 because they don’t let you configure your own virtual hardware spec. You must choose amongst a small selection of machine sizes. While EC2 delivers enormous flexibility, this is one short coming. I believe that many people have decided to take EC2’s flexibility and then deal with scalability by going horizontal. Why is this perspective important? It acts as a rate-limiter for people that might try to scale vertically. If you’re on EC2 there is a very clear maximum amount of power you will get out of one machine. Every so often Amazon increases these limits. It’s basically the same with every other cloud provider.

  3. The comment I’d like to make regarding nosql vs sql is how much more productive nosql databases are for development. I’ve been using sql databases for years and I’m currently using CouchDB on few (web) projects, and it’s the first time that my database layer is as flexible as my presentation layer. In the world of agile development and weekly/daily updates of products, this is a big plus.

  4. “… since SQL is so widely understood and SQL programs tend to reliable.” Shouldn’t it end with “… Tend to be reliable.”

    The example with Ruby is anecdotal and I’ve seen many developer mess up paginated query (in different languages and framework), this is not directly linked with how Rails work. You should use it to illustrate a point made more generally. Maybe move this to the ORM section, since it is this mapping that is the source of this problem.

    In Global reach some hosting providers will offer a faster backbone with multiple points of entry, so the distance from the user is less of a factor. See Peer1 network for an illustration.

    “… but are not part of the “NoSQL”, “NoACID” fad: ” is the word “fad” really needed here. Obviously there is a fad, but do you need to point it out in your article?

    You may want to mention HBase in the NoSQL (document databases) systems it is used by Facebook.

    The conclusion is sound for a RDBMS course, even if a little conservative. Why no mention of postgreSQL and MySQL?

  5. Thanks, everyone, so far for the great feedback. Even if we can’t get all of this into the article we can certainly put it into the discussion.

    Michael: we can draw some diagrams on the blackboard.

    J-F Noel: Why no mention of MySQL? We’re using it for the class! I’m constantly amazed at its limitations and silent failures to give any of the protections you’d expect to get, e.g., CHECK constraints. We’ll probably talk about PostgreSQL right at the beginning and one of the teachers is a PostgreSQL expert.

  6. I too think the word “fad” is unnecessarily dismissive. Different species of data persistence solutions have evolved for different environmental conditions. Comparing Oracle RAC with CouchDB really is like comparing a beluga and a zebra. The stripes don’t help it hold its breath at all!

  7. I think your assumption about number of queries per-page is low. You mention Ruby on Rails and it’s propensity to allow careless programmers to generate LOTS of queries, but it seems to me like web apps are doing many more queries per-page now, even hand-tuned ones. I’d guess 30 queries per page is closer to the mark than 3.

  8. I think that was an excellent article and I thank you for your contribution to the Great NoSQL Flamewars. I’d call it a debate but the rational debate on the topic seems to have ended somewhere in 2010 and all that’s left is rabid nutbags yelling at each other.

  9. I’d like to see more information about the No-SQL options, AFAIKT you are mostly writing about the “advantages” of RDBS, there is just a brief mentioning of OODBs, but there we see not similiar calculated numbers.

    You write if you look after identity OODBS are fast but could they be faster than just looking up one record with the “primary” key?

    You mentioned briefly what may happen with some ORM. I see this as the biggest problem currently. We do have this mismatch between objects and “relations”, and the ORM do not cope well with it also AFAIKT. (I just see it everywhere that you never really see a little more OO in this examples, just the instance variables which are attributes…. I you have 1:n relations you may get a collection of other “objects” just again glorified as objects but in it’s core just “tables”.)

    A simple problem which is a no-problem in OO languages.
    Inheritance. Just check the ways of achieving it and you can see the shortcomings immediatly.

    I think this mismatch really is the biggest problem. And so one has to see how to get around that. I’m missing a serious mentioning of this problem.

    On the other hand I think it does quite fair on RDBS..

  10. Friedrich: Is chasing a pointer in an OODBMS faster than looking up a row by primary key in RDBMS? It should be. The first operation is O[1] and the second operation, working through a B-tree index, is O[log N] where N is the number of rows in the table.

  11. The primary shortcomings of RDBMS vs OODBs are not performance related. That is a secondary effect of bad design. In order to get good performance out of a RDBMS, joins need to be avoided. That leads to duplication and tables that are not sufficiently normalized. In combination with the too simple data type models that leads to wrong data in most systems.

  12. @philg: You wrote:
    “The first operation is O[1] and the second operation, working through a B-tree index, is O[log N] …

    Is it really O(1)? In RAM I would think it to be true, but on Disk?
    I just remember something about Objectstore and there the lookup was IMHO also around O(log N), but I do not really remember….

    Are OODB really just big hashes? I don’t know. And what about caching? Couldn’t be there a cache of often used relations which are just looked up once?

    I really would be suprised that the RDB Vendors have not thought or worked on it for ages. So with the high maturity of RDBs I’d expect some quite clever optimizations. Especially if it comes to saving “Objects”. I can’t tell e.g what e.g the facilities of Postgres have to offer in this regard.
    file:///usr/share/doc/postgresql-doc-8.4/html/datatype-oid.html

    And I do not know what e.g Oracle has done in the area of using Java together with it’s databases…

  13. According to C.J. Date, current RDBMS’s have come nowhere near attaining the full potential of the relational model. “The truth is, the relational model has never been properly implemented in commercial form, and users have never really enjoyed the benefits that a truly relational product would bring.” — page 280 of C.J. Date’s “SQL and Relational Theory”. This book has the subtitle “How to Write Accurate SQL Code” but perhaps a better subtitle would be “Why SQL Fails for the Relational Model.”

    C.J. Date has been active in database theory for 40 years — he wrote the textbook I used as a student in the 1970’s. Fifteen years ago he produced “The Third Manifesto” proposing a full implementation of the classic relational model and a family of languages capable of expressing the full model. SQL encourages (and perhaps requires) a table-based implementation that limits performance. Date’s definition of the relational model speaks of scalar types, relation types, relation variables and relation operators, not rows and columns and tables.

    From page 36: “A proper object/relational system is just a relational system with proper type suppport (including proper user defined type support in particular) — which just means it’s a proper relational system, no more and no less. And what some people are pleased to call “the object/relational model” is, likewise, just the relational model, no more and no less.”

    In other words, he “impedence mismatch” is only between the object-oriented model and SQL, not the object-oriented model and the relational model.

    It’s disappointing that this work of 15 years has not resulted in available products. I don’t know why this is. I would like to see Stonebraker address and criticize Date’s work here, but on the web page you site, where Stonebraker replies to Andrew D. Wolfe, Jr., he seems to be arguing for the end of the tables-with-rows-and-columns era, i.e. the end of the SQL era, with which Date would fully agree.

  14. s/words, he “impedence mismatch”/words, the “impedence mismatch”/

    s/on the web page you site/on the web page you cite/

  15. Friedrich: Is pointer chasing in an OODBMS really O[1] and does it matter whether it is in RAM or on disk? First, if the database being queried doesn’t fit into RAM you’re going to be in a world of hurt for transaction processing. So let’s assume that the working set of the database fits into RAM for either one. My understanding of OODBMS dates from the 1980s, but basically the ones with which I was familiar mapped all of their objects into virtual memory and then used the operating system to keep track of and write back dirty pages. So pointer chasing for reads could be done just the same as if the objects were not persistent, i.e., an O[1] lookup into a random-access memory when you already have the address.

    With an RDBMS, the B-tree index is expected to be in RAM as well, but the program still needs to look up rows by the value of the primary key, which involves going down O[log n] levels in the B-tree.

  16. Hello philg. I’m a bit short on time, but have looked around a bit. If you like you may visit:
    http://web.progress.com/de-de/objectstore/

    and see what they write.

    I just cite from one of the papers there: Memory-Centric Data Management

    But the most natural implementation for an
    OODBMS relies on chains of random-access pointers, and that
    architecture is only performant in a memory-centric implementation.
    Progress’s ObjectStore boasts some extremely high-performance
    applications, notably the Amazon.com bookstore – a huge application
    distributed over a large number of application servers.

    So they write it’s only that performant if it’s in memory (which is surely understandable) but the work to get it into memory and “keep” the data consistent even in applicatons which can be accessed from wherever means. They have to “look-up” the information on disk. So we can not see the the OODB without looking at disk-access only. The history of RDBs is quite long and I would think (because it’s the center of it all) that “fast” data-access is one of the highest priorities. I do not know what was done in that area, neither from RDBS or OODBs, but I think it matters…

Comments are closed.