How many updates per second can a standard RDBMS process?

Folks: This weekend I’m teaching a class at MIT on RDBMS programming. Sadly I forgot to learn the material myself before offering to teach it. In giving students guidance as to when a standard RDBMS is likely to fall over and require investigation of parallel, clustered, distributed, or NoSQL approaches, I’d like to know roughly how many updates per second a standard RDBMS can process. Here are the criteria:

  • mixture of inserts, updates, and selects, with the server doing 10 queries for every update and the typical update or insert involving two rows
  • a 4U server with 8 CPU cores and enough RAM to hold the entire database
  • eight hard drives so that transaction logs, indices, and tables can all reside on separate spindles (or magic RAID approaches can be used)
  • perhaps a little bit of magic in the form of a solid-state drive being used for the transaction logs
  • any RDBMS, including MySQL/InnoDB, Oracle, Sql Server, DB2, etc.

I’m looking for a number that is good to within one order of magnitude.

Thanks.

[http://www.tpc.org/tpce/results/tpce_perf_results.asp makes it look as though the answer is “about 2000” though I’m not sure if the TPC-E benchmark simultaneously requires some queries. It is tough to say because the tpc.org Web site is unusably slow…. I finally got the Word doc. It looks like they count some queries as “transactions”, e.g., “The [Customer-Position] transaction is a real-only transaction”. So maybe the answer is closer to 1000? The TPS-E benchmarks use some ridiculously expensive systems, but they also require a huge amount of data to be kept, e.g., 8 TB for a low-end system doing 2000 queries and updates/second.]

(The venerable TPC-C standard never struck me as matching up that well to typical Internet applications.)]

10 thoughts on “How many updates per second can a standard RDBMS process?

  1. This question can be answered by Kingsley Idehen, CEO of OpenLink software in Burlington MA – the foremost expert in relational and SQL DB’s at scale. I have put the question to him and will get back here when he pipes up.

    Of course, things have changed in the business and what we used to expect is no longer considered acceptable.

  2. About a year ago I benchmarked PostgreSQL 8.3 and Oracle 10g on a pair of HP blade servers with 2x 2.4GHz quad-core Opterons and 8GB each, and one 320GB FusionIO PCIe SSD module each. The list price of the SSDs, $10,000+ each, was much higher than the rest of the systems put together.

    One node was primary, the second backup, using Skytools walmgr for PostgreSQL and DataGuard for Oracle. The query mix was 50-50 read-write, and 50-50 using SSD or SSD+HDD tablespaces. The queries were simpler than the ones you describe, but still representative of our production workload. We had stringent SLAs in terms of latency, 30ms max response time at the 95th percentile.

    PostgreSQL handled roughly 16,000 transactions per second, with 0.9ms latency at 95th percentile, and Oracle was comparable (Oracle licensing does not permit disclosing benchmark results).

    Most NoSQL databases are immature for operational deployment, and have serious functional compromises. They can be appropriate for specific operations, but with fast SSDs, I don’t think they make sense for any but the largest Web 2.0 shops.

  3. Thanks, Fazal. When you say “16,000 transactions/second” you’re also including the reads, correct? So 8,000 updates/second? That’s much faster than I would have expected with HDD.

    When you did the benchmark did you have a lot of different simultaneously connected clients or just a couple of clients feeding the database a never-ending massive file of SQL?

    And why did you need such big SSDs? Wouldn’t it be adequate to have ones just large enough to hold a few redo logs and then let the Oracle archiver copy them off to a HDD?

  4. Yes, 8000 tps read-only and 8000 tps read-write.

    The test was made using Grinder, a multithreaded Java load-testing framework on a machine as powerful as the DB servers, with about 200 simultaneous connections.

    The proof of concept was for a web service that would be called by websites during the content rendering loop, i.e. the transactions, along with Internet round-trip time and app server processing, would be in the critical path for rendering a website, and market feedback indicated 500ms total extra page load time was the most our clients were prepared to tolerate. When you deduct 2xRTT and an estimated ping time of 160ms, that only leaves 180ms to serve the request. The budget for the database itself was 30ms, over a 6TB data working set, sharded across multiple instances. Not all data needed to be stored on SSD, only some critical tables, with most of the activity log and analytics (about 200TB total) perfectly happy with HDD speeds.

    A big part of the test was to find out if there were bottlenecks in the DBMS engines that could cause priority inversion, with all-SSD transactions waiting for HDD transactions to complete before they could acquire a lock. Oracle and PostgreSQL passed the test, but MySQL failed miserably, with some transactions taking over 100ms at 95th percentile (the FusionIO latency itself is 0.2ms, so the bottlenecks were in the RDBMS).

    These latency SLAs may be tight by web standards, but in the financial industry, an extra millisecond of latency saved would gain a medium-sized brokerage an estimated $50M per year in extra high-frequency trading profits. The SLAs for electronic trading networks are expressed in microseconds (Island ECN guarantees 400us end to end, for instance). Needless to say, they do everything in RAM, sometimes even in FPGAs, and can’t use Java because garbage-collection can take up to 30ms at random times. Throughput is not the only metric when analyzing the performance of a transaction-processing system.

  5. It’s all about locality. The disk subsystem likes it, because 1) update of a single row and update of a 8k disk page cost the same disk IO and in many cases same CPU time, and 2) an extent of 200 consecutive 8K pages is fetched or written in 15-18ms, same time is needed for two disk pages in different locations. As a multiplication of these two effects, write of a dozen of integers here and there may cost as much as IO of megabytes. Even if the active set is all in memory, changes can not reside memory-only, they should go to the persistent storage, with some reasonable rate.

    “As of August 11, 2009, the LUBM 8000 load speed was 160,739 triples-per-second on a single machine with 2 x Xeon 5520 and 72G RAM. Adding a second machine with 2 x Xeon 5410 and 16G RAM, and 1 x 1GigE interconnect, the load rate increased to 214,188 triples-per-second.” ( http://www.openlinksw.com/weblog/oerling/?id=1568 ) The software is Virtuoso 6 Cluster, set up with 8 partitions per host. 160000 triples per second with four-way indexing of triples and additional 2-way indexing of IRIs and IRI namespaces and additional 2-way indexing of literals, so it’s 650000 to 700000 index-level inserts per second. No doubt, these updates were not random. They were carefully pipelined then sorted then pipelined again to get best possible logical proximity of data to change, then the affected pages were carefully sorted by the database to get best disk throughput with minimal number of head strokes, and even before these operations the engine made its best to forecast the nature of the activity and to leave some free room for data that may come later and should be fit close to similar data inserted before.

    It’s all about locality, part II. CPU cache is much faster than the RAM. That’s the main reason why not-yet-released Virtuoso 7 outperforms Virtuoso 6 some about 10 times on some loads — it accesses same or bigger amounts of data but that data are more compact, so in reality same CPU cache rows are accessed many times. No doubt, that happens not for all loads, and the worst case is as bad as before.

    Finally, the database is not a write-only memory, not a “mass grave”, Queries to be answered and DML statements to be executed compete, because total resources are limited. Again it’s all about locality, part III.

    The conclusion is very simple. Don’t look at benchmarks, don’t look at abstract numbers, be ready to spend time and money for polishing a specific application with some realistic data. When the maximum throughput is achieved, the next problem may occur: how to make the throughput worse… to fight the latency.

  6. Ivan: Those 160,739 triples per second were not loaded in 160,739 transactions, were they? My understanding from Oracle is that bulk loading can be 50 times faster than sending in data one or two rows at a time with SQL inserts.

  7. Philip, of course that was sort of carefully polished bulk. In case of transactions, the behavior would heavily depend on OS. In many systems, entering and leaving a mutex costs almost nothing, unlike long switching a thread on an attempt of entering a mutex that is entered already. The main headache of transactional mode is not re-fetching some extra data to ensure the integrity (roots of trees are cheap) but the statistically high risk of swithing a thread on mutex.

    So in many SMP installations the number of transactions does not matter, the locality does matter (again) but this time when the locality is too good it can add a huge penalty even if transactions never touch same rows same time so they never wait for a specific row and never roll back.
    I should ask Orri Erling about exact numbers, he knows them much better. (I’ve tried to call him but it’s too late today in his timezone). My guess is that the bulk non-transactional 160K triples per second would become 20K or 80K or 100K transactions on same hardware depending on the mode of writing the transaction log (wait for write of each transaction, wait for transfer of data to the device or fire to OS and forget), and good RAID controller would eliminate the bottleneck of logging. But that is carefully optimized case, and the real application can be terribly slow.

    For a shared-nothing cluster, transaction log is no longer a bottleneck, the limitations come from number of transactions per second per client, because the commit of a transaction includes network latency of the cluster plus some queues and buffers in node-to-node IPC. Sufficient number of clients can make quite big numbers of transactions. What’s funny, single bulk load fails to feed a cluster with data, hungry nodes wait for single feeder that reads from the dump file. It’s better to forget about locality (to a degree) and run some number of bulk loads in parallel (e.g. by cutting the dump into separate files and loading these files in parallel on different nodes).

    If transactions are not generated by stored procedures but come from clients, be read to spend huge amounts of time for round-trips and context switches and compilation of incoming queries. IPC can easily cost more than all other things altogether. We had a live specimen of an web server + application server with additional network activity + database that dispatched an enormous SMS traffic from web applications to mobile networks. It consumed almost 100% CPU of four big Sun boxes. When same functionality was moved into a single executable without significant changes in business logic, that executable was running on a single box consuming about 10% of CPU. The rest of hardware was reused for rotating banners etc. As the service become more popular, that banners occupied almost all CPU time of boxes, but the CPU consumption of the main executable remained basically the same — about 10% of CPU of one box. So relocating application logic and the web server directly to the database server is not a bad idea.

  8. Will this class be available for review on MIT OCW?

    While I can not comment on the performance numbers, it would seem prudent to include a brief discussion on SAN architectture, Fiber Channel Vs. NAS. RDBMS systems that you describe as we know are usually SAN attached. This architecture will most likely impact the performance numbers in the RDBMS.

    JJD – Learnin’ from afar…

  9. JJ: Will it be on OCW? I hope not! The class will be three days of people in a room solving problems on their laptops while the teachers mill about and try to help. There will be some discussions among students and staff, but overall I would say that it would make for some pretty bad TV.

  10. Ahhh…that would not be as entertaining. I thought it had some preliminary lecture for the first few hours.
    Wish I was there to take the class. I am in the midst of attempting to build a DB that can assign switch ports in a enterprise class data center.
    JJD – Back to Google….

Comments are closed.