August 20, 2014

InnoDB memory allocation, ulimit, and OpenSUSE

I recently encountered an interesting case. A customer reported that mysqld crashed on start on OpenSUSE 11.2 kernel 2.6.31.12-0.2-desktop x86_64   with 96 GB RAM when the innodb_buffer_pool_size was set to anything more than 62 GB. I decided to try it with 76 GB. The error message was an assert due to a failed malloc() in ut_malloc_low() in ut/ut0mem.c inside InnoDB source code. InnoDB wraps the majority of its memory allocations in ut_malloc_low(), so to get an idea of the pattern of requested allocations I added a debugging fprintf() to tell me how much was being allocated and whether it was successful.

I discovered something interesting. I expected the allocation to fail on the 76 GB of the buffer pool, due to some weird memory mapping issue and a continuous block of 76 GB not being available. However, that is not what happened. 76 GB buffer was allocated successfully. What was failing is the allocation of 3.37GB after that. What in the world could InnoDB need that was 3.37 GB? There was nothing in the settings that asked for anything close to 3 GB explicitly.

Source code is the ultimate documentation, and I took advantage of that. My good friend GDB guided me to buf_pool_init() in buf/buf0buf.c. There I found the following:

That was the buffer pool itself, the 76 GB of it. And now the buffer pool’s friend:

3.6 GB of it!

From the comments in the code (InnoDB code actually has very good comments), max_size is the maximum number of buffer pool pages (16K each), n_frames which is the same thing unless AWE is used, but it was not used, so I did not worry about it.

What shall we call that friend? It is used for storing some meta information about buffer pool pages. The most natural name I could come up with from reading the source code is the blocks array.

Thus we can see that we are allocating another chunk that is in proportion to the setting of innodb_buffer_pool_size for the blocks array. The exact proportions will probably vary from version to version, but roughly about 1 G for every 25 G of the buffer pool. This can become significant in the proper innodb_buffer_pool_size estimations when the system has a lot of RAM and you want to have the largest possible innodb_buffer_pool_size. Do not forget to give the blocks array some room!

While this was an interesting investigation, it nevertheless did not explain why there was not enough room for a 76 GB buffer pool. Even with the extra 3.37 GB allocation, there was still some free memory. Or was there? Maybe some hidden monster was eating it up? I quickly wrote this hack to prove or disprove the monster’s presence.

I verified that I could allocate and initialize two chunks of 40 GB from two separate processes, but not 80 GB from one. In fact, 80GB allocation failed right in malloc(), did not even get to initialization. I tested it with allocating 70 GB concurrently in each process so as to overrun physical memory + swap. Both allocations were successful, one initialized successfully, the other was killed by the OOM kill during initialization.

This smelled like a low ulimit, and sure enough it was. ulimit -m ulimited; ulimit -v unlimited did the magic, and mysqld successfully started with an 80 GB buffer pool. Apparenly OpenSUSE defaults are set in proportion to physical memory to keep the memory-hungry applications from taking the system down. On this particular system (96 GB physical memory, 2 GB swap it decided to set the virtual memory ulimit (-v) to 77.27 GB, and the physical memory (-m) to 80.40 GB).

About Sasha Pachev

Sasha is a former Percona employee.
Sasha formerly worked at MySQL, where he was the original creator of MySQL's native replication. He is the author of two books: MySQL Enterprise Solutions and Understanding MySQL Internals.

Comments

  1. The buf_pool->blocks overhead is independent of the page size. For 8kb pages InnoDB uses about 1 byte in buf_pool->blocks for every 12.5 bytes in the buffer pool. For 4kb it is about 1 byte in buf_pool->blocks for every 6.25 bytes in the buffer pool. We need to fix that overhead to make smaller page sizes efficient. Much of the overhead is from the synchronization objects (rw-lock and mutex) that use too much memory.

  2. Dimitri says:

    It’s still unclear why there was a problem to allocate a 63GB buffer pool if the initial system limit was 77GB..

    As well it’ll be fine to mention the MySQL/InnoDB version :-)

    Rgds,
    -Dimitri

  3. Dimitri says:

    It’s still unclear why there was a problem to allocate a 63GB buffer pool if the initial system limit was 77GB..

    As well it’ll be fine to mention which the MySQL/InnoDB version was used :-)

    Rgds,
    -Dimitri

  4. peter says:

    Dimitri,

    I think Sasha did not quite explain what the problem was…. It was set ulimit settings. Honestly we have not investigated if it is this OS version which comes with ulimit’s set to portion of memory by default or if it was customer’s configuration. It was just so unusual as we almost never see ulimit set for memory or virtual memory.

  5. Alfa says:

    I want to use memori allocation for mysql optimiztions. InnoDB memory allocation guide.

  6. Mark:

    Thanks for the clarifications. Good to know.

    Dimitri:

    This was MySQL Percona version 5.0.91. However, the issue of needing significantly more than the buffer pool is not unique to this version.

    I have not researched in detail why with 77GB VM ulimit the buffer pool could only be 62 GB. My goal was to make it possible for it to use 76 GB on that system, which we achieved. I did not play the game of pushing the buffer pool to the highest possible value that would allow mysqld to start without disabling InnoDB or plain crashing because that is a dangerous game. malloc() can return successfully, but it does not mean that all is well. Something like this can happen, for example. The buffer pool is allocated and partially touched. Then over a period of time something else is allocated and touched, and we are now out of physical memory. Then the buffer pool is touched more, and we now start swapping. Or, if we are out of swap space by now as well, then mysqld gets killed either by the OOM killer in the kernel or via segfault while trying to grab another page from the kernel. Yes, that’s right, when the system starts hurting for memory, it is possible to segfault while writing to a successfully allocated block that malloc() returned!

    However, if we tried to explain where the memory has gone, we can say that the blocks array would add about 3 GB. So now you have only 12 GB left before you hit the limit. Some memory for the binary itself and the shared libraries. Some for the thread stacks. Some for other internal needs of MySQL. Some waste from memory space fragmentation. It would all add up eventually. The bottom line is that if you set a ulimit on MySQL memory usage, you are likely to see a surprise.

    An interesting experiment. On a test server (do not try in production), set a ulimit that is barely big enough to make mysqld start properly. Then run it under load with multiple clients. More often than not you will not see a memory allocation failure, it will be death from signal 11 (SIGSEGV).

  7. Anonymous says:

    > Yes, that’s right, when the system starts hurting for memory, it is possible to segfault while writing to a successfully allocated block that malloc() returned!

    You should read about overcommit.

    > Then run it under load with multiple clients. More often than not you will not see a memory allocation failure, it will be death from signal 11 (SIGSEGV).

    Apart some bugs, you know, stack space doesn’t come out of thin air.

Speak Your Mind

*