In-Memory Microbenchmark (Large)

Symas Corp., June 2014


Continuing on from before, this RocksDB performance report tested an in-memory workload consisting of a 512M record ~60GB database on a server with 144GB of RAM. That seems like a ridiculous waste of a machine to me, so just to prove the point, we proceeded to test a workload with a 1000M (one billion) record ~120GB database on our server with only 128GB of RAM.

This data set is too large to run on our machine's 64GB tmpfs, so we store the database on an XFS partition of our server's storage array. This HP DL585 G5 has an HP Smart Array P400 controller with 8 HP DG146BABCF 146GB 10Krpm SAS drives attached, running in RAID5. The controller has 208MB of read cache and no write cache.

The RocksDB test focuses on multithreaded read performance for a purely in-memory database. The data set size for this test is still smaller than the machine's RAM, but as you'll see, most of the DB engines are too bloated to operate on it in a purely in-memory mode.

Test Overview

Using the server we generate a database with one billion records. The records have 16 byte keys and 100 byte values so the resulting database should be about 116GB in size. After the data is loaded a "readwhilewriting" test is run using 16 reader threads and one writer. All of the threads operate on randomly selected records in the database. The writer performs updates to existing records; no records are added or deleted so the DB size should not change much during the test.

The pertinent results are tabulated here and expanded on in the following sections.
EngineLoad TimeOverheadLoad SizeWrites/Sec Reads/SecRun TimeFinal SizeCPU%Process Size
WallUserSysKBWallUserSysKBKB
LevelDB00:49:05.9300:49:32.6200:08:57.031.191355531211389964483762380903:00:37.0005:38:40.4700:50:20.16123425380215%122142816
Basho02:59:15.0002:29:49.0200:35:05.301.0315499768137026844131011415803:01:08.0005:33:58.9301:32:31.55128267064235%72605236
BerkeleyDB13:13:26.0004:12:01.1500:33:27.760.35980569682048245881567403:12:09.0000:08:37.5800:07:58.232047996448%69855128
Hyper01:13:44.0001:00:02.3100:09:58.770.949611211611394464075263728503:00:37.0008:00:16.5901:44:16.55123031132323%123462360
LMDB00:11:49.8700:08:44.5900:03:05.150.999816867912580652825092182298902:26:30.0018:00:18.9200:15:44.151258498761493%125852820
RocksDBpfx02:39:36.0002:57:52.1900:06:27.981.154988512912208050822966120403:13:11.0002:23:25.3200:25:17.8114556258487%128229904
TokuDB01:43:13.0001:35:19.2700:20:16.911.1200032295122792020366703:00:31.0000:20:59.3100:09:12.7312262734416%75835612
WiredLSM02:43:55.0004:09:25.9301:51:27.182.2016380275120969092240484418903:20:30.0007:33:12.1902:18:10.51152352148294%120303088
WiredBtree00:51:05.3300:15:47.5400:04:25.410.395699647311909747674124503:03:01.0000:11:41.8400:12:19.0612872486813%94761320

Loading the DB

The stats for loading the DB are shown in this graph.

The "Wall" time is the total wall-clock time taken to run the loading process. Obviously shorter times are faster/better. The actual CPU time used is shown for both User mode and System mode. User mode represents time spent in actual application code; time spent in System mode shows operating system overhead where the OS must do something on behalf of the application, but not actual application work. In a pure RAM workload where no I/O occurs, ideally the computer should be spending 100% of its time in User mode, processing the actual work of the application. LMDB is effectively ideal.

The "Overhead" column is the ratio of adding the User and System time together, then dividing by the Wall time. It is measured against the right-side Y-axis on this graph. This shows how much work of the DB load occurred in background threads. Ideally this value should be 1, all foreground and no background work. When a DB engine relies heavily on background processing to achieve its throughput, it will bog down more noticeably when the system gets busy. I.e., if the system is already busy doing work on behalf of users, there will not be any idle system resources available for background processing.

Only LMDB has an Overhead of 1.0 - it requires no background processing to perform the data load. The other two Btree engines, BerkeleyDB and WiredTiger, have ratios far below 1 - they're stuck waiting for I/O. The overhead of the other engines is also partially masked by their respective I/O wait times.

This graph shows the load performance as throughput over time:

Due to the extremely long runtime of the BerkeleyDB load, a logarithmic scale was used for the Time axis of this graph. This scale actually helps reveal some important characteristics of these engines. Both LMDB and WiredTiger Btree show a stepwise decline in throughput. Since they are Btrees, every time the number of records grows past a certain threshold the tree height increases by one, which adds one more level of indirection to every record lookup. The steps where these thresholds occur are easily visible in the LMDB graph; they're visible to a lesser extent in the WiredTiger Btree graph. Note also that the WiredTiger Btree actually finished loading its data in about 19 minutes; the wall clock time stretched out to 51 minutes because the process was waiting to flush all its data to disk before exiting.

The graphs for the other engines shows the main claim to fame for LSM designs - linear write speed on average. The actual throughput is still quite erratic and all of them are several times slower than LMDB.

Run Time

The stats for running the actual readwhilewriting test are shown here.

The test duration would normally be controlled by how long it takes for the 16 reader threads to each read one billion records. For expedience, the test was capped at 3 hours per engine. Only LMDB was able to process all 16 billion reads in under 3 hours. At their last reported average read speed, it would have taken LevelDB 186 hours to complete the test - well over a week. Basho LevelDB would have taken 313 hours, BerkeleyDB 6594 hours - over 39 weeks. HyperLevelDB would have taken 119 hours, RocksDB would have taken 3691 hours, or about 22 weeks. TokuDB would have taken 6663 hours - even longer than BerkeleyDB. WiredTiger LSM would have taken 100 hours, and WiredTiger Btree would have taken 3569 hours.

The total User and System time is expected to be much larger than the Wall time since a total of 17 threads are running (16 readers and 1 writer). Ideally the total should be exactly 16x larger than the Wall time.

How close each DB reaches the ideal is shown in this graph:

In fact LMDB runs at 1599% for the majority of the test, but the reader threads finish at varying times, so eventually only a few reader threads remain, which lowers the CPU utilization correspondingly. The rest of the engines are all heavily I/O bound.

Performance

The actual throughput in operations per second is shown in this graph.

The left axis measures the Write throughput and the right axis measures the Read throughput. For this test the writers were allowed to run with no constraints, just to see what peak speeds they could achieve. Interestingly enough, none of the "write-optimized" engines could match LMDB's write speed, and none of them are within an order of magnitude of LMDB's read speed.

The read speed for LMDB is under-reported here; the sustained read speed for the majority of the test run is over 2 million reads/second before the earlier-finishing threads drop out. (This still doesn't match the 7 million reads/second the RocksDB folks report, but then again the Xeon E5-2660 processors they're using are at least 3 years newer technology than our Opteron 8354s, and at least 5 times faster. In light of the results we've seen here, I'd expect LMDB on their hardware to easily hit tens of millions of reads per second.)

Detailed graphs of the per-thread throughput of some of the engines are included below. Not all of them are graphed, because some of them were so slow that they didn't produce any incremental stats output at all during their 3 hour run. (The incremental stats were printed after every 1048576 completed operations, so this means they were unable to complete that many operations in 3 hours.)


All of the reader threads were represented in the incremental stats output, but only thread #3 made any progress. Its throughput basically mirrors the writer's throughput, so apparently its I/O requests were piggybacking on the writer's.


Only two of the reader threads produced any incremental stats. From the moment they first register, they appear to piggyback on the writer's I/O requests.


All of the reader threads were represented in the output, but only thread #3 made any substantial progress. After its initial spike it piggybacked with the writer.


All of the threads were quite busy. While most of the readers' throughput is quite erratic, there are several threads with basically constant throughput in the graphs. Also at the tail end of the graph the effect of the various readers finishing early is clearly seen, as the remaining threads jump up in throughput. This is also obviously where the CPU utilization numbers drop from 1599%.


Only the writer thread makes any progress here, the readers never showed up.


All of the threads reported at least one incremental result, but only 3 of the readers made any significant progress.

The recurring pattern of thread #3 and #8 making progress with the writer implies a systematic bias in the db_bench software. Perhaps the sequence of keys from the random number generator was falling into sync between these threads.

The LSM-based designs are clearly able to reach much higher peak write speeds than LMDB. However, none of these peaks can be sustained for more than a brief moment, and thus their overall throughput is still slower. LMDB delivers continuous consistent throughput for both reads and writes.

Context?

One of the key design elements in the read optimization for LMDB is that readers make no blocking calls whatsoever. Many other systems claim to be "lock-free" or "wait-free" but these are all built upon atomic primitives that (even if only briefly) require exclusive access to a synchronization structure before progress can be made. None of these lock-free or wait-free approaches can scale well across multi-socket multiprocessor machines; they all result in multiple CPUs competing for simultaneous access to a single structure in memory.

LMDB's lockless reader design already proves its effectiveness in the results shown above. Further evidence and explanation can be gleaned from another part of the raw test data - the number of context switches performed by each benchmark process. The results are shown as both Voluntary and Involuntary context switches. Voluntary context switches are those which occur when a program calls a function that can block - system calls, mutexes and other synchronization primitives, etc. Involuntary context switches occur e.g. when a CPU must handle an interrupt, or when the running thread's time slice has been fully consumed.

This table shows the number of context switches, taken from the 20M record test. (That test was chosen because it's the only one where every engine ran to completion, thus all of the runs executed the same number of DB read operations.)
EngineLoadingRunning
VoluntaryInvoluntaryVoluntaryInvoluntary
LevelDB1064255851164550551355109
Basho903713967192848625
BerkeleyDB81362334346315635750
Hyper1464661321092068324644
LMDB8114761424227
RocksDB322845094800997407085
TokuDB5105341768634719826549808
WiredLSM18420830093115320245824
WiredBtree190188240593247020

Here's the same data from the 1000M test.
EngineLoadingRunning
VoluntaryInvoluntaryVoluntaryInvoluntary
LevelDB36860776728112807198665694
Basho179127283378899867681107827
BerkeleyDB20269926134482675178715828
Hyper391097512737206860119953399
LMDB284641323443734758
RocksDB33637127633565244239954
TokuDB1002703721917315419310858
WiredLSM170367168619964485594419818800
WiredBtree7208929701435853242041

These tables show one of the main reasons LMDB is so much faster than every other DB engine. All of the other engines have several orders of magnitude more context switches overall, and the vast majority of them are voluntary - that means blocking system calls and locking operations. In contrast LMDB has practically none; a detailed profile of these runs would show that the voluntary switches in these runs are mostly due to the db_bench framework, and not LMDB itself.

Instead, LMDB grabs hold of the CPU, uses it to the utmost, and doesn't let go until the OS forces it. This is one of the key bits of minutiae often overlooked in software designs, and where most systems fall down - they may have efficient-seeming algorithms on paper, but their implementations are riddled with inefficiencies. LMDB is lean and mean; there's no bloat and nothing getting in the way between your application and your data.

Space Used

Finally, the space used by each engine is illustrated in this graph.

The Load Size shows the amount of space in use at the end of the loading process. The Final Size shows the amount used at the end of the test run. Ideally the DB should only be 116GB since that is the total size of the one billion records. Also, since there were no add or delete operations, ideally the Final Size should be the same as the Load Size.

The Process Size shows the maximum size the test program grew to while running the test. In this case, many of the processes were bumping into the maximum memory limit of the machine. It's worth noting that Basho LevelDB, BerkeleyDB, TokuDB, and WiredTiger Btree were the only engines that actually seemed to honor their cache_size setting. As such they were configured with 64GB caches for this run. LevelDB, HyperLevelDB, RocksDB, and WiredTiger LSM appear to just expand to as much space is available; they were left with the 16GB cache setting used in previous runs. RocksDB caused the system to start swapping during its run and so the test had to be rerun with system swap disabled.

Conclusion

In-memory databases have become pretty popular recently. Even Oracle has jumped into the game. But there's more to this game than just "store your DB on tmpfs and call it good." A database's job is to store data safely and retrieve data efficiently. Storing to tmpfs doesn't qualify as safe, and burning up "background" CPU cycles doesn't qualify as efficient. If your "storage" isn't persistent, it's not storage at all, it's just a cache.

The definition of "what fits in memory" will also vary wildly depending on whose DB you're talking about. Everyone brags about their lightweight footprint but only one of these stories is true; only one DB engine actually lets you use the maximum proportion of your available system resources on your actual work. If you're looking to jump on the In-Memory bandwagon, there's only one viable choice.

With LMDB you get your work done with no added overhead. LMDB stores just the data you asked it to store, with no logging or other cruft, so you get the most use out of your available RAM and disk space. LMDB uses the minimum amount of CPU to store and retrieve your data, leaving the rest for your applications to actually get work done. (And leaving more power in your battery, on mobile devices.) No other DB engine comes anywhere close.

Files

The files used to perform these tests are all available for download. Command script (1000M), raw output (1000M), binaries. The source code for the benchmark drivers is all on GitHub. We invite you to run these tests yourself and report your results back to us.

The software versions we used:

Software revisions used:

violino:/home/software/leveldb> g++ --version
g++ (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

violino:/home/software/leveldb> git log -1 --pretty=format:"%H %ci" master
e353fbc7ea81f12a5694991b708f8f45343594b1 2014-05-01 13:44:03 -0700

violino:/home/software/basho_leveldb> git log -1 --pretty=format:"%H %ci" develop
16b22c8198975b62a938dff9910f4432772d253a 2014-06-06 12:25:40 -0400

violino:/home/software/db-5.3.21> ls -l README 
-rw-r--r-- 1 hyc hyc 234 May 11  2012 README

violino:/home/software/HyperLevelDB> git log -1 --pretty=format:"%H %ci" releases/1.0
a7a707e303ec1953d08cbc586312ac7b2988eebb 2014-02-10 09:43:03 -0500

violino:~/OD/mdb> git log -1 --pretty=format:"%H %ci" 
a93810cc3d1a062bf5edbe9c14795d0360cda8a4 2014-05-30 23:39:44 -0700

violino:/home/software/rocksdb> git log -1 --pretty=format:"%H %ci"
0365eaf12e9e896ea5902fb3bf3db5e6da275d2e 2014-06-06 18:27:44 -0700

violino:/home/software/ft-index> git log -1 --pretty=format:"%H %ci" master
f51c7180db1eafdd9e6efb915c396d894c2d0ab1 2014-05-30 12:58:28 -0400

violino:/home/software/wiredtiger> git log -1 --pretty=format:"%H %ci"
91da74e5946c409b8e05c53927a7b447129a6933 2014-05-21 17:05:08 +1000
All of the engines were built with compression disabled; compression was not used in the RocksDB test either. Some of these engines recommend/require use of a non-standard malloc library like Google tcmalloc or jemalloc. To ensure as uniform a test as possible, all of the engines in this test were built to use the standard libc malloc.