In-Memory Microbenchmark

Symas Corp., June 2014


This RocksDB performance report prompted us to investigate how other embedded storage engines would fare in a similar workload. For this test we used LMDB, BerkeleyDB, Google LevelDB and 3 of its derivatives - Basho LevelDB, HyperLevelDB, and Facebook's RocksDB, as well as TokuDB and WiredTiger.

This selection of engines is interesting because they're all closely related in one way or another: obviously the LevelDB databases are all of a breed, all using Log Structured Merge (LSM) trees as their main data store. LMDB and BerkeleyDB are both Btree engines. WiredTiger comes from the people who created BerkeleyDB, and it offers both LSM and Btree engines. Both LMDB and TokuDB's APIs are based on the BerkeleyDB API. TokuDB is unique here, being the only engine implementing Fractal Trees. (And since Fractal Trees are patented, they will probably remain the sole implementation for some time to come...)

The RocksDB test focuses on multithreaded read performance for a purely in-memory database. As such, none of the tests shown here involve any disk I/O and the data sets are chosen to ensure they fit entirely in the test machine's RAM.

Tests were conducted on two different test environments - one an Asus NP56D laptop with 16GB of RAM and quad core AMD A10-4600M APU, and an HP DL585 G5 server with 128GB of RAM and 4 quad core AMD Opteron 8354 CPUs.

Note: as with the microbench reports we published earlier, we did not design the test scenarios. The LevelDB authors designed the original microbench scenario, and the RocksDB authors designed this one. We have received criticism for running tests that are artificially biased to show LMDB in its best light. Such statements are nonsense; we didn't design the tests. If you have issues with how the tests were designed, take it up with the LevelDB or RocksDB authors, respectively.

1. Footprint

One of the primary reasons to use an embedded database is because one needs something lightweight with a small application footprint. Here's how the programs in this test stack up, using identical driver code and with their DB libraries statically linked into the binaries to show the full DB code size. (Note that this still fails to take into account other system libraries that some engines use that others don't. E.g., Basho and RocksDB also require librt, the realtime support library.) We've also listed each projects' size in lines of source code, as reported by Ohloh.net.

size db_bench*
textdatabssdechexfilenameLines of Code
2853061516352287174461c6db_bench 39758
384206 9304 3488 396998 60ec6db_bench_basho 26577
1688853 2416 312 1691581 19cfbd db_bench_bdb 1746106
315491 1596 360 317447 4d807 db_bench_hyper 21498
121412 1644 320 123376 1e1f0 db_bench_mdb 7955
1014534 2912 6688 1024134 fa086 db_bench_rocksdb 81169
992334 3720 30352 1026406 fa966 db_bench_tokudb 227698
853216 2100 1920 857236 d1494 db_bench_wiredtiger 91410

LMDB is still the smallest by far.

2. Small Data Set

Using the laptop we generate a database with 20 million records. The records have 16 byte keys and 100 byte values so the resulting database should be about 2.2GB in size. After the data is loaded a "readwhilewriting" test is run using 4 reader threads and one writer. All of the threads operate on randomly selected records in the database. The writer performs updates to existing records; no records are added or deleted so the DB size should not change much during the test.

The tests in this section and in Section 3 are all run on a tmpfs, just like the RocksDB report. I.e., all of the data is stored only in RAM. Additional tests using an SSD follow in Section 4.

The pertinent results are tabulated here and expanded on in the following sections.
EngineLoad TimeOverheadLoad SizeWrites/Sec Reads/SecRun TimeFinal SizeCPU%Process Size
WallUserSysKBWallUserSysKBKB
LevelDB00:34.7000:44.7200:06.701.48184438042246004102322667800:49:58.7301:31:48.6200:52:50.953452388289%2138508
Basho00:40.4101:24.3900:17.822.52932442462368768102326841800:19:32.9401:14:10.0400:01:19.192612436386%6775376
BerkeleyDB02:12.6101:58.9200:13.570.9990950909584437695658620200:15:28.4400:42:07.9700:17:27.495839912385%3040716
Hyper00:38.7800:49.8800:06.431.452037132522464481020813839300:09:38.3900:35:06.1200:02:06.182292632385%2700088
LMDB00:10.5500:08.1500:02.370.9971563981251619210224144970900:00:55.4600:03:37.6300:00:01.672547968395%2550408
RocksDB00:21.5400:34.7000:05.991.88904363972256032102339154400:14:37.7400:54:06.8400:02:38.043181764387%6713852
TokuDB01:45.1201:41.5800:47.371.41695205482726168988110968200:12:12.9100:37:41.4500:07:10.033920784367%5429056
WiredLSM01:10.9302:35.5500:18.622.455519526324924401023017961700:07:26.2400:28:55.8500:00:07.762948988390%3205396
WiredBtree00:17.7900:15.6800:02.090.998875772923818761002175207800:01:53.4600:06:36.9800:00:14.784752568362%3415468

Loading the DB

The stats for loading the DB are shown in this graph.

The "Wall" time is the total wall-clock time taken to run the loading process. Obviously shorter times are faster/better. The actual CPU time used is shown for both User mode and System mode. User mode represents time spent in actual application code; time spent in System mode shows operating system overhead where the OS must do something on behalf of the application, but not actual application work. In a pure RAM workload such as this, where no I/O occurs, ideally the computer should be spending 100% of its time in User mode, processing the actual work of the application. Both LMDB and WiredTiger Btree are close to this ideal.

The "Overhead" column is the ratio of adding the User and System time together, then dividing by the Wall time. It is measured against the right-side Y-axis on this graph. This shows how much work of the DB load occurred in background threads. Ideally this value should be 1, all foreground and no background work. When a DB engine relies heavily on background processing to achieve its throughput, it will bog down more noticeably when the system gets busy. I.e., if the system is already busy doing work on behalf of users, there will not be any idle system resources available for background processing.

Here the 3 Btree engines all have an Overhead of 1.0 - they require no background processing to perform the data load. In contrast, all of the LSM engines require significant amounts of processing to perform ongoing compaction of their data.

This graph shows the load performance as throughput over time:

It makes the difference in performance between the DB engines much more obvious. BerkeleyDB clearly adheres to the "slow and steady" principle; its throughput is basically constant. Basho shows wildly erratic throughput. The others are all fairly consistent at this small data volume.

Run Time

The stats for running the actual readwhilewriting test are shown here.

The test duration is controlled by how long it takes for the 4 reader threads to each read 20 million records. The total User and System time is expected to be much larger than the Wall time since a total of 5 threads are running (4 readers and 1 writer). Ideally the total should be exactly 4x larger than the Wall time. How close each DB reaches the ideal is shown in this graph:

Google LevelDB shows the worst scaling; it isn't even able to make full use of 3 CPU cores.

Performance

The actual throughput in operations per second is shown in this graph.

The left axis measures the Write throughput and the right axis measures the Read throughput. The writers were constrained to no more than 10240 writes per second, as in the RocksDB report. (This humble little laptop could not sustain 81920 writes per second.) The graph shows that BerkeleyDB, TokuDB, and WiredTiger Btree were unable to attain this write speed, let alone exceed it.

The WiredTiger Btree gives an impressive read throughput, giving it a solid second place in the results. None of the other engines are even within an order of magnitude of LMDB's read performance. Graphs with a detailed breakdown of the per-thread throughput are available on the Details page.

Space Used

Finally, the space used by each engine is illustrated in this graph.

The Load Size shows the amount of space in use at the end of the loading process. The Final Size shows the amount used at the end of the test run. Ideally the DB should only be 2.2GB since that is the total size of the 20 million records. Also, since there were no add or delete operations, ideally the Final Size should be the same as the Load Size.

Most of the DB engines (except LMDB) have significant space overhead, and unfortunately this graph doesn't even capture the full scope of this overhead - during the run various log files will be growing and being truncated, so the actual space used may be even larger than shown here.

LMDB uses no log files - the space reported here is all the space that it uses.

The Process Size shows the maximum size the test program grew to while running the test. This is another major concern when trying to determine how much system capacity is needed to support a given workload. In this case, all of the engines (except LMDB) require memory for application caching. They were all set to run with 6GB of cache. Both Basho and RocksDB would have used more memory if available; in an earlier run with the cache set to 8GB they both grew past 9GB and caused the machine to start swapping. While most people believe "more cache is better" the fact is that trying to use too much will hurt performance. As always, it takes careful testing and observation to choose a workable cache size for a given workload.

It's not clear to me why any DB engine would need more than 6GB of memory to manage 2GB of actual data. With LMDB's Single-Level-Store, cache size is a non-issue and the engine can never drive a system into swapping. There's no wasted overhead - all of the memory in the system gets applied to your actual application, so you can get more work done with any given hardware configuration than any other database.

3. Larger Data Set

These tests use 100 million records and are run on the 16 core server. Aside from the data set size things are much the same. Here are the tabular results:
EngineLoad TimeOverheadLoad SizeWrites/SecReads/SecRun TimeFinal SizeCPU%Process Size
WallUserSysKBWallUserSysKBKB
LevelDB03:06.7504:41.2600:42.871.7356358768112733969184759401:00:02.0001:22:11.4601:52:10.4613734168323%3284192
Basho04:22.9611:09.2402:18.933.073357164611449492102118013501:00:23.0014:32:23.6700:11:49.40138412201464%19257796
BerkeleyDB14:59.4513:34.3001:25.1512838195633785506601:00:02.0003:02:00.6912:42:39.63283878801573%14756768
Hyper03:43.6105:41.1400:39.021.700102857711280092102311167301:00:04.0001:59:42.0901:53:24.2715149416387%6332460
LMDB01:04.1500:52.3100:11.820.99968823071260533210230248680000:11:14.1402:47:58.5700:00:10.06126276921598%12605788
RocksDB02:28.6603:59.9200:30.971.8222117584112896881023212939701:00:22.0012:08:05.9402:51:58.54127777081490%18599544
TokuDB07:44.1009:17.3102:54.821.57752639521266513646017020801:00:15.0003:02:37.4411:21:45.00153289561434%23315964
WiredLSM07:10.5019:25.8002:31.103.0590011614122546201019427841501:00:05.0015:51:04.1700:02:09.76160162961586%17723992
WiredBtree02:07.4901:49.5200:17.9711193262010145132093900:20:58.1005:06:13.6000:05:14.87238653681560%20743232

Loading the DB

The stats for loading the DB are shown in this graph.

The overall trends are about the same as for the test with 20M records.

This graph shows the load performance as throughput over time:

It's a bit more revealing than the 20M test. BerkeleyDB continues to deliver its rock-steady throughput. LevelDB and Basho shows the infamous negative spikes in throughput caused by periodic compaction, although Basho's later performance seems even more pathological than usual. RocksDB shows linearly decaying throughput with data volume. LevelDB and HyperLevelDB show asymptotically decaying throughput.

Run Time

The stats for running the actual readwhilewriting test are shown here.

This time the test duration was capped at 1 hour, simply for the sake of expedience. Only LMDB and the WiredTiger Btree were actually able to process all 100 million records in under 1 hour. As before, the total User and System time is expected to be much larger than the Wall time since a total of 17 threads are running (16 readers and 1 writer). Ideally the total should be exactly 16x larger than the Wall time. How close each DB reaches the ideal is shown in this graph:

Both Google LevelDB and HyperLevelDB are unable to scale beyond 4 cores. They have major lock contention issues. Basho, RocksDB, and TokuDB have locking issues as well, though to a much lesser degree.

(Note: in the raw output you'll see that we also ran LMDB and WiredTiger Btree for an hour, like all the others. Just to show that nothing self-destructs over time. LMDB can handle whatever workload you throw at it, non-stop.)

Performance

The actual throughput in operations per second is shown in this graph.

The left axis measures the Write throughput and the right axis measures the Read throughput. The writers were constrained to no more than 10240 writes per second, as in the RocksDB report. LevelDB, BerkeleyDB, and TokuDB are unable to achieve this write rate. The WiredTiger Btree engine again gets a solid 2nd place for read rate. See the Details page for a detailed analysis of each engine's performance in this test.

Space Used

Finally, the space used by each engine is illustrated in this graph.

In this test a 16GB cache was configured, and the DB itself should have been only 11GB. It's important to note that for all of the engines besides LMDB, the total runtime footprint is the sum of the Final Size and the Process Size. I.e., when the database is stored on disk, the OS also caches a copy of every accessed page, and the DB engine makes whatever copies it needs in its own internal cache. But for LMDB, the total runtime footprint is just the Process Size, since it is using the OS cache directly and not making redundant copies of anything.

The significance of a Single-Level-Store design cannot be overstated - when working with an in-memory workload, every redundant byte means one less byte for useful data. One may easily find that workloads that are too large to operate in-memory with other DB engines fit smoothly into RAM using LMDB. And moreover, LMDB delivers in-memory performance without requiring the use of volatile storage (like tmpfs), so there's no need to worry about migrating to a different storage engine as the data sizes grow.

The space used is a major concern even in this testing environment. The RocksDB test used a server with 144GB of RAM for 500 million records, which should have consumed about 60GB. With runtime overheads, they ended up at around 75GB total. On our server with 128GB of RAM, the tmpfs will only hold 64GB. It's just barely large enough for LMDB to duplicate the 500M record test, but none of the other DB engines will fit. This is unfortunate, because we can't get a directly comparable (500 million record) result on the current setup. But our 64 core server with 512GB of RAM should be getting freed of VM hosting duty soon, so we'll be running some additional tests on that box in the near future.

Other Notes

Both Basho and RocksDB attempt to open a huge number of files at once. This led me to run the tests as superuser in order to raise the nfiles ulimit sufficiently to complete the tests. Also, we discovered a bug in Basho that caused the tests to hang in 4 out of 5 tries.

In the original RocksDB test, they ran RocksDB with a WriteAhead Log enabled, to persist the data onto a disk while operating on tmpfs. We were unable to run the test this way because of a bug in the current RocksDB code. With LMDB such measures are unnecessary anyway, since LMDB can operate on top of a regular filesystem instead of requiring tmpfs.

4. Small Set on Disk

Since using the tmpfs basically ate 50% of the RAM in the system, that puts a severe constraint on how large a data set can be managed. Also, this is an unrealistic way to use a database since none of the data will actually be persisted to real storage. Given how much time it takes some of these engines to load the databases, one would not want to have to reload the full contents before every run. As such, we also decided to test on a regular filesystem. We are still using a data set smaller than the size of RAM, but since we're not sacrificing 50% of RAM for tmpfs, the data set can be larger than in the prior test. On the laptop we use 50M records, so around 6GB of data. It's using a 512GB Samsung 830 SSD and an ext4 partition.

The actual drive characteristics should not matter because the test datasets still fit entirely in RAM and are all using asynchronous writes. I.e., there should still not be any I/O occurring, and no need for the test programs to wait for any writes to complete. Here are the tabular results:

EngineLoad TimeOverheadLoad SizeWrites/SecReads/SecRun TimeFinal SizeCPU%Process Size
WallUserSysKBWallUserSysKBKB
LevelDB00:01:57.2200:02:17.6400:00:35.511.477137007359744961022815180000:23:11.6901:20:31.8300:07:01.876226672377%9213436
Basho00:02:38.0200:04:15.2000:01:25.902.1585875206601748810229877302:02:03.0005:12:44.0502:12:08.078465664364%8234040
BerkeleyDB00:08:46.9500:05:26.2000:01:10.820.75343011671370035674797044300:49:44.6101:57:17.3100:51:02.5213732924338%6626740
Hyper00:02:07.8100:02:32.2400:00:28.511.414208590959668281022519424500:18:04.7101:08:37.7000:02:12.707997988391%8269804
LMDB00:00:31.7900:00:22.1400:00:09.590.998112614659584810234137488600:02:33.2400:09:59.3800:00:07.256627556395%6630132
RocksDB00:00:38.4400:00:41.5900:00:24.191.711238293463952961023012714700:27:30.9901:45:55.5700:03:01.707076928395%10040740
RocksDBpfx00:05:24.5500:05:50.7600:00:24.541.156370359463984481023342411900:08:17.6100:26:57.2600:05:14.136875504388%9426456
TokuDB00:04:35.6200:04:19.6900:01:59.101.374319715680514281744352901:20:20.0004:17:36.9400:46:30.357016752378%8120724
WiredLSM00:03:11.6200:07:15.9900:01:28.382.736509758963371321021913559000:25:48.8801:39:58.0500:00:58.358184716391%8948796
WiredBtree00:01:15.9900:00:42.4300:00:12.270.71983155686243828994823895700:15:05.2500:39:29.2800:08:36.8212487884318%9396320

There are other significant differences to point out in this test run. In the prior tests, each engine was basically run with default settings; the only non-default was the cache size. For this test we adopted the tuning options that RocksDB used in their report for the LevelDB-related engines. We also set LMDB to use its writable mmap option instead of the default read-only mmap.

For TokuDB, the test always crashed from running out of memory when configured with a 6GB cache (even though there was still a couple GB of RAM free on the machine) and so we had to pare it back to 4GB to get a complete run.

The Basho test was manually terminated at the 2 hour mark. It would have taken at least 6 or 7 more hours to complete on its own.

Update - 2014-06-16: Due to a transcription error when copying the RocksDB parameters from their site, we were still using the default memtable representation in the previous runs. We have re-run this particular test using "--key_size=16 --prefix_size=16 --keys_per_prefix=0 --memtablerep=prefix_hash" and added the result as "RocksDBpfx" to the table above and in the new charts below. Sorry for the mistake.

Loading the DB

The stats for loading the DB are shown in this graph.

The results show a couple surprises. While all of the DBs are performing asynchronous writes, we see that both BerkeleyDB and WiredTiger Btree are only getting about 70-75% CPU use here, which indicates that they spent a significant portion of time waiting for I/Os to complete. LMDB still has an overhead of 1.0 as usual. The RocksDB tuning options appear to have helped speed up its load time considerably, getting it much closer to LMDB's speed.

This graph shows the load performance as throughput over time:

With the large Write Buffer Size set, all of the LevelDB-based engines turn in closer-to-linear throughput, but Basho still shows a steady decline, and all of them are still quite erratic.

This test highlights another major problem with so many of these engines - they all require complex tuning to get decent performance out of them. The tuning complexity of BerkeleyDB was one of the main issues that prompted us to write LMDB in the first place. Tuning RocksDB for this test requires explicit setting of 40-some parameters, as seen in the command scripts. This is an unreasonable demand on end-users, and indeed there's an open bug report for RocksDB on this very issue.

One of the many lessons we learned from 15+ years of working with BerkeleyDB is that adding code to address performance issues only makes things slower overall, harder to use, and harder to maintain. The key to good performance is writing less code, not more. Quality is always more important and more effective than quantity. This is why we put the Footprint overview right up front, in Section 1 - if you scan back through this report you can easily see how code size correlates to performance.

Run Time

The stats for running the actual readwhilewriting test are shown here.

As noted above, the Basho run was terminated at the 2 hour mark because it was taking too long to finish. All of the other engines were able to complete the test in under 2 hours. Even with the added tuning options, Basho just doesn't run well in this setting. The scaling graph doesn't give any clues either:

The only engines that show significant I/O overhead are BerkeleyDB and WiredTiger Btree. LevelDB's scaling improves dramatically with the added tuning options.

Performance

The actual throughput in operations per second is shown in this graph.

The left axis measures the Write throughput and the right axis measures the Read throughput. The writers were constrained to no more than 10240 writes per second, as before. BerkeleyDB, and TokuDB are unable to achieve this write rate.

TokuDB's performance seems to really suffer from the 2GB reduction in its cache size. Unfortunately, there was no way to give it any more memory.

None of the other engines are anywhere close to LMDB's read rate. This result demonstrates that, as we've said before, LMDB delivers the read performance of a pure-memory database, while still operating as a persistent data store.

See the Details page for a detailed analysis of each engine's performance in this test.

Space Used

Finally, the space used by each engine is illustrated in this graph.

As mentioned, a 6GB cache was configured and the DB itself should only have been 5.8GB. The Final Size and Process Size reported for Basho cannot be relied on since that test was incomplete.

All of the DB engines besides LMDB bumped into the limits of the memory on the machine. In contrast, LMDB would easily handle twice as much data and still leave a few GB of RAM to spare, and continue to perform at top speed.

In a private conversation, a Tokutek engineer admonished me "you have to give TokuDB at least 50% of RAM, otherwise it's not fair to compare it to an mmap'd database that can use as much RAM as it wants."

Here's the reality - LMDB uses less RAM than every other DB engine to get its work done. If your engine needs 4x as much RAM to do its work, then your engine is inherently limited to doing 4x less useful work on any given machine. Why should anyone waste their time and money on a system like that?

With LMDB you get your work done with no added overhead. LMDB stores just the data you asked it to store, with no logging or other cruft, so you get the most use out of your available RAM and disk space. LMDB uses the minimum amount of CPU to store and retrieve your data, leaving the rest for your applications to actually get work done. (And leaving more power in your battery, on mobile devices.) No other DB engine comes anywhere close.

5. Further Testing

Check out our even larger test which really drives the point home - results for one billion records. A new test has also been added to show scaling with the number of reader threads.

Files

The files used to perform these tests are all available for download. Command script (20M), raw output (20M), command script (100M), raw output (100M), command script (50M), raw output (50M), command script (50M, RocksDBpfx), raw output (50M, RocksDBpfx), binaries. The source code for the benchmark drivers is all on GitHub. We invite you to run these tests yourself and report your results back to us.

The software versions we used:

Software revisions used:

violino:/home/software/leveldb> g++ --version
g++ (Ubuntu/Linaro 4.7.3-1ubuntu1) 4.7.3
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

violino:/home/software/leveldb> git log -1 --pretty=format:"%H %ci" master
e353fbc7ea81f12a5694991b708f8f45343594b1 2014-05-01 13:44:03 -0700

violino:/home/software/basho_leveldb> git log -1 --pretty=format:"%H %ci" develop
16b22c8198975b62a938dff9910f4432772d253a 2014-06-06 12:25:40 -0400

violino:/home/software/db-5.3.21> ls -l README 
-rw-r--r-- 1 hyc hyc 234 May 11  2012 README

violino:/home/software/HyperLevelDB> git log -1 --pretty=format:"%H %ci" releases/1.0
a7a707e303ec1953d08cbc586312ac7b2988eebb 2014-02-10 09:43:03 -0500

violino:~/OD/mdb> git log -1 --pretty=format:"%H %ci" 
a93810cc3d1a062bf5edbe9c14795d0360cda8a4 2014-05-30 23:39:44 -0700

violino:/home/software/rocksdb> git log -1 --pretty=format:"%H %ci"
0365eaf12e9e896ea5902fb3bf3db5e6da275d2e 2014-06-06 18:27:44 -0700

violino:/home/software/ft-index> git log -1 --pretty=format:"%H %ci" master
f51c7180db1eafdd9e6efb915c396d894c2d0ab1 2014-05-30 12:58:28 -0400

violino:/home/software/wiredtiger> git log -1 --pretty=format:"%H %ci"
91da74e5946c409b8e05c53927a7b447129a6933 2014-05-21 17:05:08 +1000
All of the engines were built with compression disabled; compression was not used in the RocksDB test either. Some of these engines recommend/require use of a non-standard malloc library like Google tcmalloc or jemalloc. To ensure as uniform a test as possible, all of the engines in this test were built to use the standard libc malloc.

Tests comparing tcmalloc and jemalloc are available in the malloc microbench report. Tests comparing different compression mechanisms are available in the compressor microbench report.