HyperDex Benchmark

Symas Corp., August 2013


This page shows performance of LMDB vs HyperLevelDB as used in HyperDex. HyperLevelDB was forked from Google's LevelDB by the HyperDex developers in an attempt to fix some of LevelDB's erratic performance issues. The LMDB version of HyperDex is git rev 8af7e0180d8dce857b6d0ac7977ee51bc14fd456 while the HyperLevelDB version is rev 968c7cc333c7000c14637958ccfc5f28f6dbc9ae. Both are available in this Github repo.

We used HyperLevelDB rev 0e4446225cd99942ce452973663b41d100d0730b from Github and LMDB rev 2cc2574d84686d2e2556e86f78a962bd593af19c from Gitorious.

Also we're using the LMDB version of Replicant for the HyperDex coordinator in all the tests. This version is available on Github. This is only mentioned for completeness, as the HyperDex developers have assured us that there is no performance impact from the Replicant.

Setup

HyperDex is a NoSQL data store that offers some impressive distributed data features. But in this test we're only looking at the performance of LMDB against HyperLevelDB, so only a single data node is used in the test. We're using the Yahoo Cloud Serving Benchmark to test the HyperDex server.

Ordinarily we benchmark using either our 16-core 128GB RAM server or our 64-core 512GB RAM server but NoSQL folks tend to be more interested in running on dinky little boxes with much less memory. For this test the data node is my old Dell Precision M4400 laptop with quad-core Intel Q9300 CPU @ 2.53GHz and 8GB DDR2 DRAM. There are two sets of tests being performed, one with 10 million records, using a Crucial M4 512GB SSD on a reiserfs partition. (This is the system disk of the laptop, so it is about 50% full and has been steadily used for several months now.) At 4000 bytes per record the resulting database is around 40GB. A second test is performed with 100 million records. In this case, there wasn't enough free space on the SSD so a Seagate ST9500420AS HDD with XFS partition was used instead.

The YCSB load generator and the HyperDex coordinator/Replicant are running on an Asus N56DP laptop with quad-core AMD A10-4600M APU and 16GB DDR3 DRAM. The load generator never gets anywhere near stressing out this machine. The laptops are plugged into Gbit ethernet through a TP-Link WR-1043ND wifi router, and there is no other traffic on the network. The data node is booted in single-user mode so no other services are running on the machine.

The basic YCSB setup is copied from this mapkeeper benchmark. The main difference from their workload is that we only configure 4 threads instead of their 100 threads. Also we used more records and more operations in each test. Our workload file is provided for reference.

The HyperDex setup is copied from this HyperDex benchmark. We only create 4 partitions, instead of the 24 they used. There is no fault tolerance since we have only one data node, but that's not relevant to this test.

Results

Using YCSB we first load 10 million records. Then we perform 1 million random updates and reads on the resulting database. In this phase, 20% of the operations are writes, and 80% are reads. For this test the HyperDex daemon is started with "-s 51200" with LMDB, to tell it to use a mapsize of 50GB. No special options are used for HyperDex with HyperLevelDB.

10M Records, Sequential Insert


HyperLevelDB still suffers from long intervals of near-zero throughput due to the extreme write amplification of the LSM design. It's also a CPU hog, regularly consuming 390% of the CPU during the load, while LMDB uses a steady 125% throughout. Both are performing asynchronous background flushes for this test.

LMDB gives consistent, fast response while HyperLevelDB's response is wildly unpredictable. At its peak speed it is only 80% as fast as LMDB, and overall LMDB completes the load more than two times faster than HyperLevelDB. HyperLevelDB also uses more than five times as much CPU as LMDB. To summarize:
10M records, Sequential Insert
MinLatency(us)AvgLatency(us)95th%ile(ms)99th%ile(ms)MaxLatency(ms)Runtime(sec)Throughput(ops/sec)CPUtime(mm:ss)
LMDB1834190025651058944919:44.52
LevelDB18990702339252279438896:46.96


10M Records, 1M Ops


Both deliver fairly smooth performance for this test, but again LMDB is more than twice as fast as HyperLevelDB. This test uses very little CPU since it is mostly I/O bound. Still HyperLevelDB uses more than twice as much CPU as LMDB.
10M records, 1M ops
MinLatency(us)AvgLatency(us)95th%ile(ms)99th%ile(ms)MaxLatency(ms)Runtime(sec)Throughput(ops/sec)CPUtime(mm:ss)
LMDB update225543011814369731:33.17
LMDB read1635700120
LevelDB update22711632314730732563:37.67
LevelDB read191122823206

Here's the mdb_stat output for the resulting LMDB database:

Environment Info
  Map address: (nil)
  Map size: 53687091200
  Page size: 4096
  Max pages: 13107200
  Number of pages used: 10363447
  Last transaction ID: 10220428
  Max readers: 126
  Number of readers used: 1
Freelist Status
  Tree depth: 1
  Branch pages: 0
  Leaf pages: 1
  Overflow pages: 0
  Entries: 7
  Free pages: 45
Status of Main DB
  Tree depth: 4
  Branch pages: 5733
  Leaf pages: 357666
  Overflow pages: 10000000
  Entries: 20210850
There are 10,000,000 overflow pages, which makes sense since the records are 4000 bytes each. One record occupies one overflow page. The 5733 branch pages consume 22.9MB, while the 357666 leaf pages consume 1.43GB. Thus, even though the total database is 5 times larger than RAM, all of the key lookups can be memory resident, and at most one disk I/O is needed to retrieve a record's data. There are over 20 million items in the DB; HyperDex stores additional information per user record. The additional data is inconsequential here.

The raw test output is available in this tar archive. It is also tabulated in this OpenOffice spreadsheet.

100M Records, Sequential Insert

We repeated the tests using 100 million records, as the HyperDex developers stated that the differences between HyperLevelDB and plain LevelDB don't really manifest in smaller workloads. The LMDB HyperDex daemon is started with "-s 512000" to set a 500GB map size.

At this scale HyperLevelDB is over five times slower than LMDB. Also I'm not sure this data fairly represents the time HyperLevelDB required, since after the YCSB client finished, the data node took another 2 hours to compact the database, with the disk bandwidth maxed out and CPU use over 150%. As such we had to wait 2 hours before we could begin the run phase of the test.

We have heard criticisms of earlier LMDB tests because their runtimes were considered "too short." The fact is, other DBs need to run for tens of hours to complete the same amount of useful work LMDB accomplishes in only a few hours.

The difference in efficiency is enormous. Using LevelDB for a workload means using 3-4x as much CPU and far more disk bandwidth, resulting in higher electricity consumption, greater wear and tear on storage systems, and earlier death/wear-out of storage devices. Even though HyperLevelDB improves on LevelDB's write concurrency and compaction efficiency, it is still far less efficient than LMDB, using almost 15 times as much CPU in this task.
100M records, Sequential Insert
MinLatency(us)AvgLatency(us)95th%ile(ms)99th%ile(ms)MaxLatency(ms)Runtime(sec)Throughput(ops/sec)CPUtime(mm:ss)
LMDB173544001439137027298227:26
LevelDB1992737411175736869714563373:13


100M Records, 1M Ops


LMDB is still about twice as fast as HyperLevelDB at this scale, and while the throughput is no longer a smooth curve, LMDB is still much more consistent than HyperLevelDB. Again, while the task is mostly I/O bound, LMDB uses much less CPU time than HyperLevelDB.
100M records, 1M ops
MinLatency(us)AvgLatency(us)95th%ile(ms)99th%ile(ms)MaxLatency(ms)Runtime(sec)Throughput(ops/sec)CPUtime (mm:ss)
LMDB update2563356513020963783851194:21.41
LMDB read21533493130207817
LevelDB update2416366023138520370158636317:27
LevelDB read188632502303837904

Here's the mdb_stat output for the resulting LMDB database:

Environment Info
  Map address: (nil)
  Map size: 536870912000
  Page size: 4096
  Max pages: 131072000
  Number of pages used: 103667712
  Last transaction ID: 100481421
  Max readers: 126
  Number of readers used: 0
Freelist Status
  Tree depth: 1
  Branch pages: 0
  Leaf pages: 1
  Overflow pages: 0
  Entries: 18
  Free pages: 145
Status of Main DB
  Tree depth: 5
  Branch pages: 58461
  Leaf pages: 3609103
  Overflow pages: 100000000
  Entries: 200269469
There are 100,000,000 overflow pages, which is as expected. The 58461 branch pages consume 233.8MB, while the 3609103 leaf pages consume 14.4GB. Now that the total database is 50 times larger than RAM, around half of the key lookups will require a disk I/O. Also another disk I/O will be needed for the record data. According to this hard drive review, a 4KB I/O on this drive will take 16ms on average. At two I/Os per DB request, we see the result is quite close to the 33ms average latency that LMDB delivers here.

It's not so simple to analyze HyperLevelDB's I/O load since the DB design is so byzantine, manipulating multiple files and making massive copies of data from one file to another as it attempts to merge one level of tables into the next. But the numbers speak for themselves: LMDB's simple design is still most efficient, getting the required work done with a minimum of CPU use, disk bandwidth use, and minimum of real time.

The raw test output is available in this tar archive. It is also tabulated in this OpenOffice spreadsheet.

Conclusion

HyperDex is a very promising NoSQL data store, and we have big plans for it in the OpenLDAP Project too. But LevelDB-based storage engines are clearly a liability; LMDB outclasses them by every measure.