April 23, 2014

kernel_mutex problem. Or double throughput with single variable

Problem with kernel_mutex in MySQL 5.1 and MySQL 5.5 is known: Bug report. In fact in MySQL 5.6 there are some fixes that suppose to provide a solution, but MySQL 5.6 yet has long way ahead before production, and it is also not clear if the problem is really fixed.

Meantime the problem with kernel_mutex is raising, I had three customer problems related to performance drops during the last month.

So what can be done there ? Let’s run some benchmarks.

But some theory before benchmarks. InnoDB uses kernel_mutex when it starts/stop transactions, and when InnoDB starts the transaction, usually there is loop through ALL active transactions, and this loop is inside kernel_mutex. That is to see kernel_mutex in action, we need many concurrent but short transactions.

For this we will take sysbench running only simple select PK queries against 48 tables, 5,000,000 rows each.

Hardware is Cisco UCS C250 server. The workload is read-only and fully in memory.

There is the result for different threads (against Percona Server 5.5.17):

ThreadsThroughput, q/s
111178.34
227741.06
453364.52
892546.73
16144619.58
32164884.03
64154235.73
128147456.33
25668369.02
51240509.67
102422166.94

The peak throughput is 164884 q/s for 32 threads, and it declines to 68369 q/s for 256 threads, that is 2.4x times drop.

The reason, as you may guess, is kernel_mutex. How you can see it ? It is easy. In SHOW ENGINE INNODB STATUS\G you will see a lot of lines like:

This problem is actually quite serious. In the real workloads I saw this happening with less than 256 threads, and not all production systems can tolerate 2x times drop of throughput in the peak times.

So what can be done there ?

In the first try, let’s recall that kernel_mutex (and all InnoDB mutexes) has complex handling with spin loops, and there are two variables that affects mutex loops: innodb_sync_spin_loops and innodb_spin_wait_delay. I actually think that tuning system with these variable is something closer to dance with drum than to scientific method, but nothing else helps, why not to try.

There we vary innodb_sync_spin_loops from 0 to 100 (default is 30):

ThreadsThroughputNA
111178.34
227741.06
453364.52
892546.73
16144619.58
32164884.03
64154235.73
128147456.33
25668369.02
51240509.67
102422166.94

I was surprised to see that with innodb_sync_spin_loops=100 we can improve to 145324 q/s , almost to peak throughput from first experiment.

With innodb_sync_spin_loops=100 the kernel_mutex is still the main point of contention, but InnoDB tries to prevent the current thread from pausing, and that seems helping.

Further experiments showed that 100 is not enough for 512 threads, and it should be increased to 200.

So there is final results with innodb_sync_spin_loops=200 for 1-1024 threads.

ThreadsThroughputThroughput spin 200
111178.3411288.42
227741.0628387.62
453364.5253575.52
892546.7392184.65
16144619.58143688.91
32164884.03164392.94
64154235.73154022.57
128147456.33152280.84
25668369.02150089.31
51240509.67127680.65
102422166.9461507.08

So playing with this variable we can double throughput to the level with 32-64 threads.
I am not really can explain how it does work internally, but I wanted to show one of possible ways
to deal with problem when you hit by kernel_mutex problem.

Further direction I want to try to limit innodb_thread_concurrency and also bind mysqld to less CPUs, and also it is interesting to see if MySQL 5.6.3 really fixes this problem.


About Vadim Tkachenko

Vadim leads Percona's development group, which produces the Percona Server and Percona XtraBackup. He is an expert in solid-state storage, and has helped many hardware and software providers succeed in the MySQL market.

Comments

  1. For those who are curious, Vadim’s work here was partially in response to the mysterious kernel_mutex problem I had with a customer that turned out to be GDB-related: http://www.mysqlperformanceblog.com/2011/12/02/three-ways-that-the-poor-mans-profiler-can-hurt-mysql/

    I am also not sure why raising the variable helped. On our phone call, Vadim and I discussed the variable and I guessed that lowering it would help and raising it would make it worse, because I thought that spinning was the problem :-) oprofile reports showed that ut_delay was consuming the vast majority of CPU time, and I thought that getting rid of the wasted work might potentially help. Wrong…

  2. So looks like in 5.6 the kernel mutex may have been finally let off — http://blogs.innodb.com/wp/?p=734 (They don’t seem to date their posts .. weird)

  3. Baron,

    I believe I know the reason why innodb_sync_spin_loops helps there.

    The problem is old, and by some reason I was sure it is fixed already.

    The problem is that InnoDB uses it’s own mutex implementation, which internally uses condition variables.
    And current implementation uses pthread_cond_broadcast to wake up threads.
    That means that all ( hundreds or thousands) threads, waiting on mutex, wake up all together at the same moment
    and trying to compete for mutex again.

    Increasing innodb_sync_spin_loops allows to delay entering into using condition variables, and allows to resolve
    mutex only via spin_loop.

    In this case using innodb_thread_concurrency also should help, and I am running experiments with it right now.

  4. Raghavendra,

    Removing kernel_mutex does not automatically fixes problem, as you will face another mutex after that.
    So I would wait on the results before saying that problem is fixed.

  5. That makes sense. I think that Mark Callaghan has mentioned this problem recently too.

  6. Davi Arnaut says:

    > And current implementation uses pthread_cond_broadcast to wake up threads.
    > That means that all ( hundreds or thousands) threads, waiting on mutex, wake
    > up all together at the same moment and trying to compete for mutex again.

    pthread_cond_broadcast just requeues (FUTEX_REQUEUE) into the mutex wait list.
    Perhaps the thundering herd you mention is at some other level?

  7. Davi,

    I am not sure what you refer by FUTEX_REQUEUE to, but you caught me on curiosity so I overcame my laziness and went to
    1. http://pubs.opengroup.org/onlinepubs/009604499/functions/pthread_cond_signal.html
    it says:
    “The pthread_cond_broadcast() function shall unblock all threads currently blocked on the specified condition variable cond.”

    2. As I get used to that the documentation may be wrong, I wrote test cond.c ( actually taken from
    http://waxway.blogspot.com/2011/07/awake-all-threads-pthreadcondbroadcast.html)

    with following change:

    pthread_mutex_lock(&cond_mutex);
    pthread_cond_wait(&cond, &cond_mutex);
    printf(“T WOKE: %x\n”, pthread_self());
    pthread_mutex_unlock(&cond_mutex);

    and on single “pthread_cond_broadcast” it prints:
    T WOKE: bd143700
    T WOKE: bc742700
    T WOKE: bb340700
    T WOKE: ba93f700
    T WOKE: bbd41700

    That is all 5 threads woke up.

  8. Davi Arnaut says:

    The point is that they are not all woken up at the same time/moment. When a condition is broadcasted, the threads waiting on the condition are just moved to the wait list of the mutex, where they are woken one by one.

  9. Davi,

    If I following:
    pthread_mutex_lock(&cond_mutex);
    pthread_cond_wait(&cond, &cond_mutex);
    printf(“T WOKE: %x\n”, pthread_self());
    pthread_mutex_unlock(&cond_mutex);
    printf(“T WOKE 2: %x\n”, pthread_self());

    I get:
    T WOKE: 91339700
    T WOKE 2: 91339700
    T WOKE: 90938700
    T WOKE 2: 90938700
    T WOKE: 8f536700
    T WOKE 2: 8f536700
    T WOKE: 91d3a700
    T WOKE 2: 91d3a700
    T WOKE: 8ff37700
    T WOKE 2: 8ff37700

    on single pthread_cond_broadcast.

    This is what I refer to when I say that ALL threads wake.

    In InnoDB implementation after thread wakes it comes back to SPIN LOOP

    Simplifying, InnoDB mutex looks like:

    That is all threads in random order comes to pthread_cond_wait, but
    once mutex released, they all WAKE UP and starting loop again.

  10. Mark Callaghan says:

    They are all scheduled to run so they are all going to run. Then they will busy-wait for 20 microseconds or more in the InnoDB mutex code and then a bit more in pthread code courtesy of PTHREAD_MUTEX_ADAPTIVE_NP. Then they will go back to sleep. When there are hundreds of them they will delay productive threads from being scheduled. They will also get cache lines in read-mode so that productive threads have to do cross-socket cache operations which leads to more latency. This is very inefficient.

  11. Davi Arnaut says:

    > This is what I refer to when I say that ALL threads wake.

    Yes, eventually they will all wake up because they are waiting on the mutex. One thread will grab the mutex, and once it releases it, another thread is woken up.

    What I was replying to is:

    > wake up all together at the same moment

    Which is not true for pthread_cond_broadcast. Again, if there are threads sleeping on the condition variable, they are re-queued into waiting on the mutex. If the mutex is unlocked, only the top-waiter is waked. Only one thread may lock a mutex, so there is simply no point is waking all threads.

    References:

    1. http://lwn.net/images/conf/rtlws11/papers/proc/p10.pdf See introduction.
    2. http://repo.or.cz/w/glibc.git/blob/HEAD:/nptl/pthread_cond_broadcast.c

  12. Davi Arnaut says:

    > In InnoDB implementation after thread wakes it comes back to SPIN LOOP

    Yes, but one important point, InnoDB only uses pthread synchronization objects to implement the wait queue of an InnoDB mutex. When the threads are on the wait queue, only one will be actually woken and this one grabs the _wait queue_ lock. Soon after being wake up, the thread releases the wait queue lock, which wakes up another and so on. Outside of the wait queue, what you said applies.

  13. Setting innodb_sync_spin_loops is very interesting discussion because there is really no “right” answer – depending on what is the limiting mutex for your workload the different amount of spinning might make sense. Better solution would be to have this valuable to be set per mutex and adjusted automatically.

    I believe it would be possible to design the system which would profile how long it takes to grab the mutex – say profiling one out of 1000 mutex get request. When based on distribution we can design how long it makes sense to wait. For example if we run long spin and can discover we either get the mutex after 10us of we spin till the end of time of 1000us we can decide to spin up to 20us or so which will deal with short locks of given mutex and switch to OS wait and stop wasting CPU if not.

  14. James Day says:

    You might want to look at innodb-adaptive-max-sleep-delay in MySQL 5.6. It makes innodb-thread-sleep-delay adaptive and is of particular value over 1024 threads in 5.6.

    Sunny’s OOW presentation at https://oracleus.wingateweb.com/published/oracleus2011/sessions/20020/20020_Cho2577660.pdf mentions it on slide 28.

    James Day, Oracle. This is my view only; for an official Oracle opinion consult a PR person.

  15. Andy Carlson says:

    Vadim,

    I want to thank you for this informative post. I had a workload that I was working with a few years ago, that I could not get to perform well in innodb. It seemed like MySQL would attack one thread, and starve all the rest. I dug out the old code and data, and ran it with innodb_sync_spin_loops=64, and the workload performed much better.

    Thanks again, and I will be watching for more posts from you in the future.

  16. yangdehua says:

    we also saw this problem

    and what’s more , there is different in the manual of 5.1 and manual of 5.5

    5.1 innodb_thread_concurrency the default value is 8 http://dev.mysql.com/doc/refman/5.1/en/innodb-parameters.html#sysvar_innodb_thread_concurrency

    5.5 the default value is 0 http://dev.mysql.com/doc/refman/5.5/en/innodb-parameters.html#sysvar_innodb_thread_concurrency

    if we set innodb_thread_concurrency , the server ‘s load would get down.

Speak Your Mind

*