April 18, 2014

Scaling problems still exist in MySQL 5.5 and Percona Server 5.5

MySQL 5.5 and Percona Server 5.5 do not solve all scalability problems even for read only workloads. Workloads which got a lot of attention such as Sysbench and DBT2/TPC-C scale pretty well a they got a lot of attention, there can be other quite typical workloads however which do not scale that well. This is why it is important to test performance and scalability for your application on your hardware if you really want results most relevant for your situation.

In this example I spotted the query pattern responsible for large portion of the load in the application by aggregating query log with mk-query-digest. When I filtered out only queries of this pattern and got a simplified part of production workload which focuses only on one query but yet mimics real world values distribution.

The query looks something like this:

With secondary key defined on (group_id,deleted)

This pattern of medium size IN lists is typical for a lot of modern applications. The system I used for test was 4 socket r910 with X7542 CPUs giving 24 real cores. The workload was completely in memory in this test. I have compared Percona Server 5.1.56-12.7 and Percona Server 5.5.11-20.2 with former being configured with innodb_buffer_pool_instances=16 and 400GB buffer pool. Here are results:

First it is worth to say we see great improvements for this workload in Percona Server 5.5 showing up to 2.5 times better performance or 150%, which is a great headline numbers. If we look at more details however we can see this is improvement from complete disaster to something absolutely unacceptable. If we look at the gain system shows from performance at 1 thread to the peak performance we see the number is 2.3 for Percona Server 5.1 and around 6 for Percona Server 5.5 which is way too little for what you would expect from 24 core system. For comparison you can check
Sysbench Results which Vadim published. For read only workload we get 17x gain from 1 thread to peak performance on system with 12 physical cores and 24 threads, which is almost perfect scalability.

So there are workloads where the problem is solved and when there are where a lot of work need to be done and where you still can’t get use of more than 8 cores effectively (which would correspond to single CPU socket these days)

Here are some more details. oprofile:

PMP

P.S Disabling adaptive hash index makes things a lot worse in this workload and innodb_adaptive_hash_index_partitions=16
available for partitioning adaptive hash index does not help because everything hits the same index.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Patrick Casey says:

    The way database caches work (in the abstract) is they put the most recently used data blocks in memory instead of disk. The actual mechanics of this vary between storage engines, but the concept is the same.

    If a given block hasn’t been used recently, and other, more recently accessed, blocks need to be cached, then the older block gets discarded. Its still on disk as well, so its not lost, but its no longer in memory.

    The bog standard “solution” to the problem of keeping important stuff in cache is to get enough memory on your server and allocate it to the database such that your commonly accessed block (your working set), fit in memory.

    Not knowing the particulars of your hardware/configuration I can’t tell if you flat out need more resources or if its potentially a configuration issue on the database server (its entirely possible, for example, to configure innodb to use 512m for the buffer pool on a 32G box; not a good idea mind you, but possible.

  2. Vlad Fridman says:

    Patrick,

    Thank you for your response! In my case it is what you described in point 2. The first one is slow and the subsequent ones are fast. However, if for 1-2 minutes there is no action in the database then the next query is slow again. I guess the cache invalidates after 1-2 minutes.

    Is there any way to force the data to stay in cache for a longer time?

    Vlad

  3. Patrick Casey says:

    Vlad, did you run an explain plan?

    Assuming that id is an indexed column and the optimizer is using the index, the only plausible way that would take 0.5 seconds with modern hardware would be if the index blocks were on disk and the disk was relatively busy.

    Without knowing more about your setup anything more would be speculation, but when you see high variability in the response time for a particular query it usually means one of two things:

    1) You’re usually being saved by teh query cache, and occasionally you get a miss and have to really run the query. In this case you see a pattern of 0.01, 0.01, 0.01, 13.45, 0.01, 0.01, etc

    2) The query runs relatively infrequently and as such the data blocks needed to resolve it may or may not be in buffer cache. In this case you see a big skew in response time, but there’s usually a pattern where if you slam the query a bunch of times in a row, the first one is slow, and subsequent ones are fast.

  4. Vlad Fridman says:

    Hi Peter,

    I am new to MySQL optimization, but I ran into a problem which seems very similar to what you were benchmarking. My query is almost exactly like yours:

    SELECT id, name FROM users WHERE id IN (10000126213392676,206123495,10320001228994311,100001241345384504,10000176754237888,100002163455062416,1000017033535841276,1);

    where users table is 300 000 entries. id is the index column. I can not understand why this
    query sometimes takes 0.5 seconds to run. Is this a MySQL limitation? Does it actiually go to the disk for each user record? I run this on the 16GB shared server. Perhaps it is because I am on a shared server and there is some memory limit? Any advice would be greately appreciated.

    Thanks,
    Vlad

  5. Dimitri says:

    Hi Peter,

    I think you’re hitting “btr_search_latch” mutex contention during this test.. – did you collect also innodb mutex status while testing? – that may bring a light on your bottleneck for sure!

    (BTW, last year I’ve wrote a detailed report about this contention: http://dimitrik.free.fr/db_STRESS_MySQL_55_and_btr_search_latch_InnoDB_Oct2010.html – and as you may see depending on kind of queries you may hit or not hit btr_search_latch; Usually sysbench workload is not hitting it, DBT2 either (just I/O bound), and TPCC-like from what I saw before is blocking on the index mutex.. – that’s why I’m still using dbSTRESS, because even its default workload is still actual and pointing to existing internal bottlenecks :-) )

    However, what is looking strange in your results is the progress in TPS between 1, 2 and 4 users.. – the results here should be just linear! but it’s not so :-) ) On 4 users we’re far yet to get contentions on InnoDB internals, so I’m very curious to understand what is going wrong here :-) ) – may you replay these tests again? (and this time with mutex status monitoring?)

    As well, using 400GB Buffer Pool was not something common until now, so probably you hit a new issue?.. – is it possible to replay the same queries but with reduced “range”? (I mean that the active data set will match 40GB for ex., and then we’ll be more sure about the source of the problem) – Vadim pointed before that he observed some problems with huge Buffer Pool sizes, but mutex status details were either missed, so it’s hard to understand where the problem is..

    Rgds,
    -Dimitri

  6. Gennady says:

    We ran into something similar yesterday. We had a query that looks like SELECT .. IN (SELECT ..) which took 16 seconds on Percona 5.5 Server with SphinxSE doing query in the sub-query portion of that SQL above. Changing the query to SELECT … INNER JOIN (SELECT ..) t resulted in 0 seconds.

    Gennady

  7. Raine says:

    Woosh :) What a piece of machine! Thanks for the answer.

    Raine

  8. Peter Zaitsev says:

    Raine,

    Yeah. This was 512GB box. I should have probably went higher than 400GB.

  9. Peter Zaitsev says:

    Rob,

    Yeah it would be good to write some blog post on it or something. Right now you can understand it as magical thing which makes key lookups a lot faster. Though it can become point on contention.

  10. Raine says:

    400GB buffer size? Is that correct?

  11. Rob Wultsch says:

    Any chance I can request a post about how the adaptive hash index works? There are very few docs on this topic.

Speak Your Mind

*