July 22, 2014

Impact of memory allocators on MySQL performance

MySQL server intensively uses dynamic memory allocation so a good choice of memory allocator is quite important for the proper utilization of CPU/RAM resources. Efficient memory allocator should help to improve scalability, increase throughput and keep memory footprint under the control. In this post I’m going to check impact of several memory allocators on the performance/scalability of MySQL server in the read-only workloads.

For my testing i chose following allocators: lockless, jemalloc-2.2.5, jemalloc-3.0, tcmalloc(gperftools-2.0), glibc-2.12.1(new malloc)(CentOS 6.2), glibc-2.13(old malloc), glibc-2.13(new malloc), glibc-2.15(new malloc).

Let me clarify a bit about malloc in glibc. Starting from glibc-2.10 it had two malloc implementations that one can choose with configure option –enable-experimental-malloc. (You can find details about new malloc here). Many distros switched to this new malloc in 2009. From my experince this new malloc behaved not always efficiently with MySQL so i decided to include old one to comparison as well. I used glibc-2.13 for that purpose because later –enable-experimental-malloc option was removed from glibc sources.

I built all allocators from sources(except system glibc 2.12.1) with stock CentOS gcc(version 4.4.6 20110731). All of them were built with -O3. I used LD_PRELOAD for lockless, jemalloc-2.2.5, jemalloc-3.0, tcmalloc and for glibc I prefixed mysqld with:

  • Testing details: 
    • Cisco USC_C250 box
    • Percona Server 5.5.24
    • 2 read only scnearios: OLTP_RO and POINT_SELECT from the latest sysbench-0.5
    • dataset consists of 4 sysbench tables(50M rows each) ~50G data / CPU bound case
    • innodb_buffer_pool_size=52G
  • For every malloc allocator perform the following steps:
    • start Percona server either with LD_PRELOAD=[allocator_lib.so] or glibc prefix(see above)/get RSS/VSZ size of mysqld
    • warmup with ‘select avg(id) from sbtest$i FORCE KEY (PRIMARY)’ and then OLTP_RO for 600sec
    • run OLTP_RO/POINT_SELECT test cases, duration 300 sec and vary number of threads: 8/64/128/256/512/1024/1536
    • stop server/get RSS/VSZ size of mysqld

The best throughput/scalability we have with lockless/jemalloc-3.0/tcmalloc. jemalloc-2.2.5 slightly drops with higher number of threads. On the graph with response time(see below) there are spikes for it that may be caused by some contention in the lib. All variations of glibc that are based on new malloc with increasing concurrency demonstrate notable drops – almost two times at high threads. In the same time glibc-2.13 built with old malloc looks good, results are very similar to lockless/jemalloc-3.0/tcmalloc.

For POINT_SELECT test with increasing concurrency we have two allocators that handle load very well – tcmalloc and only slightly behind … glibc-2.13 with old malloc. Then we have jemalloc-3.0/lockless/jemalloc-2.2.5 and last ones are glibc allocators based on new malloc. Along with the best throughput/scalability runs with tcmalloc also demonstrate best response time (30-50 ms at the high threads).

Besides throughput and latency there is one more factor that should be taken into account – memory footprint.

memory allocatormysqld RSS size grow(kbytes)mysqld VSZ size grow(kbytes)
lockless6.966.736105.780.880
jemalloc-2.2.5214.4083.706.880
jemalloc-3.0216.0845.804.032
tcmalloc456.028514.544
glibc-2.13-new-malloc210.120232.624
glibc-2.13-old-malloc253.5681.006.204
glibc-2.12.1-system162.952215.064
glibc-2.15-new-malloc5.106.124261.636

 

The only two allocators lockless and glibc-2.15-with new malloc notably incressed RSS memory footprint of mysqld server – more than on 5G. Memory usage for others allocators looks more or less acceptable.

Taking into account all 3 factors – throughput, latency and memory usage for above POINT_SELECT/OLTP_RO type of workloads the most suitable allocators are tcmalloc, jemalloc-3.0 and glibc-2.13 with old malloc.

Important point to take is that new glibc with new malloc implementation may be NOT suitable and may show worse results than on older platforms.

UPDATE:

To cover some questions raised in the comments I rerun OLTP_RO/POINT_SELECT tests with jemalloc-2.2.5/jemalloc-3.0/tcmalloc, varied /sys/kernel/mm/transparent_hugepage/enabled(always|never) and gathered mysqld size with ‘ps –sort=-rss -eopid,rss,vsz,pcpu’ during the test run. Just to remind whole test run cycle looks like following:
start server, warmup, OLTP_RO test, POINT_SELECT test. So on charts below you will see how mysqld footprint is changed during the test cycle and what is the impact of disabling of hugepages.

You can read Part 2 of this topic here.

Comments

  1. Bradley C Kuszmaul says:

    Is this on Centos 6 with transparent huge pages enabled? I’ve found that transparent huge pages make jemalloc’s memory footprint much larger, otherwise without transparent huge pages, jemalloc tends to have a smaller footprint than glibc.

  2. Hander says:

    And on windows? Any benchmark?

  3. gebi says:

    Your last memory accounting statistics don’t take into account the local memory caching the allocators use internally.
    Eg. with tcmalloc it greatly depends on when you do your statistics as it has thread local memory pools and may not instantly free your memory.

  4. At the risk of asking for too much work from you, I agree with gebi that it would be good to see RSS during the test. jemalloc also has per-thread malloc caches that can be disabled via configuration options.

    I cannot figure out the RSS metric. For jemalloc 3.0 did RSS grow by ~200 kb or 200,000 kb. My confusion might be from US versus non-US style.

  5. Dimitri says:

    Many thanks for sharing it! Very useful stuff!

    Regarding Mark’s request to graph RSS during the test — you already know how to do it easily with dim_STAT ;-)

    Rgds,
    -Dimitri

  6. Quite interesting.

    1. Since we are comparing malloc performance, it is quite possible that something else may have worsened the performance of glibc over versions other than just the malloc (since glibc is not just malloc whereas others are just mallocs), though in case of glibc 2.13 it looks like clear because you have same turned off and on.

    1.1 On a related note, may be we can have a comparison between corresponding libcs — musl, uClibc,dietlibc and glibc (they seem to be on different mallocs of their own); also this way we can compare the whole stack – pthread, malloc etc.

    2. Regarding the RSS size, I would be careful before judging because -O3 generally causes compiler to be more lenient wrt. space.

    2.1 Also, I noticed in “I built all allocators from sources(except system glibc 2.12.1) with stock CentOS gcc(version 4.4.6 20110731). All of them were built with -O3″ — so glibc 2.12.1 seems to have been built with -O2 (default for distro builds) and others with O3, is that right?

    3. Another important measure is when these allocators switch from sbrk to mmap. For glibc, according to documentation, it should be MMAP_THRESHOLD which is 128 kB though in later glibcs the boundary may have become fuzzy, for other allocators it will need to be checked.

    4. Regarding tcmalloc, it is known for being more memory hungry, so numbers are not surprising. (I guess the ‘hunger’ is not a bug since for google environment it works better).

    5. Regarding ‘new’ glibc malloc, the details are quite scary if http://udrepper.livejournal.com/20948.html is anything to by. The migration from per-core pool to per-thread pool may have been bad, since assuming a cap of 8 threads per core per thread, for 100 threads and 24 cores, it can turn into a maximum of 192,000 pools on a 64 bit system.

  7. Bradley,

    Yes, that is Centos 6.2 with huge pages enabled. I’ve run several additional tests(see my last charts) and can confirm that with huge pages enabled memory footprint notably large.

  8. Hander,

    Sorry, don’t have Windows box for the testing. It seems that there are several malloc replacements around for Win platform (like nedmalloc) and indeed it would be nice to try something similar there.

  9. Mark,

    >At the risk of asking for too much work from you, I agree with gebi that it would be good to see RSS during the test. jemalloc also has per-thread malloc caches that can be disabled via configuration options.

    Done. check my latest charts for jemalloc-2.2.5/3.0 and tcmalloc.

    re: RSS metric – yes, agree it looks not really clear – that is 200,000kB=~200MB.

  10. Dimitri,

    You are welcome!
    Sure, I know. Btw: when the next improved version will be released? :)

  11. Regarding the transparent huge pages, it is related to http://lwn.net/Articles/467332/ http://lwn.net/Articles/467328/ . The defrag / compaction of THP seems to be the root cause (basically it makes the ‘reclaim’ synchronous). It is fixed in 3.3 linux and above, though CentOS/RHEL kernels may have backported them.
    Interestingly, from http://bugs.centos.org/view.php?id=5716 it looks like that only CentOS has it enabled whereas the upstream RHEL does not.

Speak Your Mind

*