Malloc Microbenchmark

Symas Corp., February 2015


Using the same DB engines as in our previous On-Disk and In-Memory microbenchmarks, we investigate the effects of changing from the standard glibc malloc implementation to jemalloc or Google's tcmalloc. The libraries are as provided in Debian Jessie, using glibc 2.19, libtcmalloc_minimal 2.2.1-0.2, and libjemalloc 3.6.0-3.

Some of the DB engines recommend a particular malloc implementation, but it's not always obvious why a given choice was made. The tests conducted here seek to reveal the impact of each choice. The tests are performed on our HP DL585 G5 server with 128GB RAM and 4 quad core AMD Opteron 8354 CPUs. Since we're just interested in how malloc affects the DB performance, this is purely an in-memory test.

1. Test Overview

For this test we use a database with 2000000 records and 4000 bytes per record, so roughly 8GB in size. The DB is stored on a tmpfs, which can grow up to 64GB on this server. We had originally tried to use a larger DB but all of the LSM engines crashed with disk full errors, even though the data volume was still much smaller than 64GB. In the current results, Basho LevelDB still filled up all 64GB of the available space even with just the 8GB data set.

After loading the DB using batched sequential writes, a readwhilewriting is test is run with 1 writer thread and 16 reader threads. None of the threads are constrained to any particular throughput level, and the test runs until all 16 reader threads have each randomly retrieved all 2000000 records.

There was a fair amount of variation between runs, so each DB engine is run 3 times with each malloc library. jemalloc and tcmalloc were invoked using LD_PRELOAD, so the exact same benchmark binaries are used in each test and only the malloc implementation is changed. The results for a given library are averaged across the 3 runs and presented in the table below. The raw data, spreadsheet, and command script are available for download.

LoadRun
FillUsrSys%WallRSSWriteReadUsrSys%WallRSS
ops/secsecsecKBops/secops/secsecsecKB
Basho
glibc37651.33123.5659.38336.33%00:54.33127196.00#DIV/0!#DIV/0!7085.821242.231315.67%10:29.3320620677.33
jemalloc35889.33123.6167.61335.33%00:56.93122224.00#DIV/0!#DIV/0!6212.081943.221496.00%09:04.8719632088.00
tcmalloc36421.33130.5858.37336.00%00:56.17134229.33#DIV/0!#DIV/0!#DIV/0!#DIV/0!#DIV/0!#DIV/0!#DIV/0!
BDB
glibc20526.3369.8330.2899.00%01:40.219309362.672676.3354408.331841.697259.201543.67%09:49.279033684.00
jemalloc20236.3371.3530.1899.00%01:41.629310096.002707.3355896.001817.457030.721542.00%09:33.519035160.00
tcmalloc19277.6776.0830.4199.00%01:46.599311358.672643.3357031.671838.086835.631542.33%09:22.209036669.33
Hyper
glibc44562.3346.9326.85161.00%00:45.71678653.3318881.00148509.3319794.44559.251369.67%03:40.3417911013.33
jemalloc44419.0047.0726.84161.00%00:45.80673344.0019595.33148247.332432.61580.121367.33%03:40.2517656761.33
tcmalloc43468.0048.8526.87160.67%00:46.95685753.3318919.33141846.002628.05531.191370.00%03:50.5117928325.33
LevelDB
glibc54178.0046.9017.73173.33%00:37.22516938.677952.3385805.333529.821326.261287.67%06:17.0911900356.00
jemalloc46198.6749.3623.88167.67%00:43.54518184.008449.6798171.333088.341260.911314.33%05:30.8311326717.33
tcmalloc51751.0050.8816.81173.00%00:39.03532433.338457.0095788.333271.561135.351301.00%05:38.5111995682.67
LMDB
glibc164342.335.647.0899.00%00:12.758070749.3333403.333028569.67145.612.221300.33%00:11.368093228.00
jemalloc162537.005.687.1799.00%00:12.888071140.0033671.673132727.33147.002.201356.67%00:11.018092724.00
tcmalloc160763.005.887.1499.00%00:13.058076445.3332289.003064187.67149.172.171344.67%00:11.258100145.33
RocksDB
glibc57961.6745.1418.18181.67%00:34.74607101.3313910.6734164.3310855.353465.731521.67%15:40.772258294.67
jemalloc52566.6744.8423.70179.00%00:38.16562550.6714762.6741509.338797.223117.371535.67%12:55.491248504.00
tcmalloc58078.6746.8117.57185.00%00:34.71626377.3311151.3340718.679193.033028.431547.00%13:09.761296916.00
RocksDB2
glibc61744.0047.4319.94204.00%00:32.90544488.0015872.67118042.333122.981030.111516.33%04:33.861467462.67
jemalloc55608.3342.7423.86183.67%00:36.16402370.6715917.33123310.002872.101121.721518.33%04:22.911086408.00
tcmalloc62242.3350.4419.34213.67%00:32.62567416.0015692.67126792.672928.49968.211526.00%04:15.271102588.00
TokuDB
glibc29050.3386.38112.60122.67%02:41.549254672.002028.3329545.006367.583116.99857.33%18:25.2115980325.33
jemalloc29038.0098.0989.84131.00%02:22.968350152.003812.6742467.006578.443138.791273.00%12:43.109109910.67
tcmalloc36285.67124.5184.64239.33%01:27.2710562993.332827.6745894.005112.531837.14973.00%11:56.4414282605.33
WiredBtree
glibc135670.007.137.6299.00%00:14.776277.3312166.67515619.33322.12462.401209.67%01:04.83793792.00
jemalloc135946.677.337.3999.00%00:14.746461.3313368.33546248.33313.83437.271208.67%01:02.11585908.00
tcmalloc123948.008.337.8799.00%00:16.238880.0012948.00532682.00307.50448.761254.67%01:00.24563261.33
WiredLSM
glibc34639.0096.2738.24228.67%00:58.701989202.6725964.67262389.331752.6483.631475.67%02:04.4020176804.00
jemalloc25730.33108.2961.66215.67%01:18.601557186.6713271.67345567.671380.9257.901516.33%01:34.8313175570.67
tcmalloc33913.33103.0439.16237.67%00:59.804079472.0027161.33260150.001807.3082.391504.67%02:05.5521171470.67

Discussion

For most of the DB engines there's very little change in performance or total memory usage between these three malloc libraries. LMDB's performance serves as a control here since it performs no mallocs at all in this benchmark. All variation in the LMDB results must be attributed to random variance in the underlying operating system. Based on these results, we use 3.5% as our margin of error in the rest of the discussion.

A number of engines appeared slower during their initial DB load when using non-standard malloc. For example, BerkeleyDB's load throughput was 7% slower using tcmalloc. LevelDB's load throughput was 15% slower using jemalloc. RocksDB was 10% slower using jemalloc. RocksDB with optimized settings was also 10% slower using jemalloc. WiredTiger's LSM was 25% slower using jemalloc. Without any solid explanation, one guess is that the DB load is primarily a single-threaded job, and the optimizations that these other malloc libraries have made to support multi-threaded workloads turns into a pessimization here. Of course, most of these engines are also doing a lot of work in background threads during the DB load, so it's not strictly a single-threaded workload. tcmalloc also generally uses more RAM in this phase.

In the readwhilewriting phase, the trend generally reverses and the non-standard mallocs tend to deliver faster throughput (with some exceptions of course). E.g. LevelDB gets 13% faster throughput using jemalloc than standard glibc malloc. RocksDB gets 18% faster with jemalloc. TokuDB gets a whole 31% faster with jemalloc and 35% faster with tcmalloc. TokuDB also uses 43% less memory with jemalloc. This result makes TokuDB's choice of jemalloc pretty understandable.

Unfortunately the results don't present a clear-cut best choice. For some engines and workloads tcmalloc is fastest, for some plain glibc is best. tcmalloc sometimes has excessive RAM usage and jemalloc generally is more compact, but not always.

Conclusion

The magnitude of the gains with non-standard mallocs is pretty remarkable, but in the grand scheme of things it doesn't amount to much; all of the affected DB engines are still an order of magnitude slower than LMDB. To quote Tim Callaghan it's not just the library, implementation counts.

When your DB engine uses 20GB of RAM to manage an 8GB database, you've done something seriously wrong. When your DB engine uses malloc so extensively that swapping out malloc libraries makes over a 30% difference in performance, you've done something seriously wrong. Yes, implementation counts. The most efficient memory allocation is the one you didn't have to make.

Files

The files used to perform these tests are all available for download.

The command script: cmd3, raw output: out.mallocs.tgz, and LibreOffice spreadsheet DBmallocs.ods.

The source code for the benchmark drivers is all on GitHub. We invite you to run these tests yourself and report your results back to us.

The software versions we used:

Software revisions used:

violino:/home/software/leveldb> g++ --version
g++ (Ubuntu 4.8.2-19ubuntu1) 4.8.2
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

violino:/home/software/leveldb> git log -1 --pretty=format:"%H %ci" master
e353fbc7ea81f12a5694991b708f8f45343594b1 2014-05-01 13:44:03 -0700

violino:/home/software/basho_leveldb> git log -1 --pretty=format:"%H %ci" develop
d1a95db0418d4e17223504849b9823bba160dfaa 2014-08-21 15:41:50 -0400

violino:/home/software/db-5.3.21> ls -l README 
-rw-r--r-- 1 hyc hyc 234 May 11  2012 README

violino:/home/software/HyperLevelDB> git log -1 --pretty=format:"%H %ci" master
02ad33ccecc762fc611cc47b26a51bf8e023b92e 2014-08-20 16:44:03 -0400

violino:~/OD/mdb> git log -1 --pretty=format:"%H %ci"
a054a194e8a0aadfac138fa441c8f67f5d7caa35 2014-08-24 21:18:03 +0100

violino:/home/software/rocksdb> git log -1 --pretty=format:"%H %ci"
7e9f28cb232248b58f22545733169137a907a97f 2014-08-29 21:21:49 -0700

violino:/home/software/ft-index> git log -1 --pretty=format:"%H %ci" master
f17aaee73d14948962cc5dea7713d95800399e65 2014-08-30 06:35:59 -0400

violino:/home/software/wiredtiger> git log -1 --pretty=format:"%H %ci"
1831ce607baf61939ddede382ee27e193fa1bbef 2014-08-14 12:31:38 +1000
All of the engines were built with compression disabled. We will compare compression engines in an upcoming test.