September 29, 2006

What to tune in MySQL Server after installation

Posted by peter

My favorite question during Interview for people to work as MySQL DBAs or be involved with MySQL Performance in some way is to ask them what should be tuned in MySQL Server straight after installation, assuming it was installed with default settings.

I’m surprised how many people fail to provide any reasonable answer to this question, and how many servers are where in wild which are running with default settings.

Even though you can tune quite a lot of variables in MySQL Servers only few of them are really important for most common workload. After you get these settings right other changes will most commonly offer only incremental performance improvements.

key_buffer_size - Very important if you use MyISAM tables. Set up to 30-40% of available memory if you use MyISAM tables exclusively. Right size depends on amount of indexes, data size and workload - remember MyISAM uses OS cache to cache the data so you need to leave memory for it as well, and data can be much larger than indexes in many cases. Check however if all of key_buffer is used over time - it is not rare to see key_buffer being set to 4G while combined size of .MYI files is just 1GB. This would be just a waste. If you use few MyISAM tables you’ll want to keep it lower but still at least 16-32Mb so it is large enough to accommodate indexes for temporary tables which are created on disk.

innodb_buffer_pool_size This is very important variable to tune if you’re using Innodb tables. Innodb tables are much more sensitive to buffer size compared to MyISAM. MyISAM may work kind of OK with default key_buffer_size even with large data set but it will crawl with default innodb_buffer_pool_size. Also Innodb buffer pool caches both data and index pages so you do not need to leave space for OS cache so values up to 70-80% of memory often make sense for Innodb only installations. Same rules as for key_buffer apply - if you have small data set and it is not going to grow dramatically do not oversize innodb_buffer_pool_size you might find better use for memory available.

innodb_additional_mem_pool_size This one does not really affect performance too much, at least on OS with decent memory allocators. Still you might want to have it 20MB (sometimes larger) so you can see how much memory Innodb allocates for misc needs.

innodb_log_file_size Very important for write intensive workloads especially for large data sets. Larger sizes offer better performance but increase recovery times so be careful. I normally use values 64M-512M depending on server size.

innodb_log_buffer_size Default for this one is kind of OK for many workloads with medium write load and shorter transactions. If you have update activity spikes however or work with blobs a lot you might want to increase it. Do not set it too high however as it would be waste of memory - it is flushed every 1 sec anyway so you do not need space for more than 1 sec worth of updates. 8MB-16MB are typically enough. Smaller installations should use smaller values.

innodb_flush_log_at_trx_commit Crying about Innodb being 100 times slower than MyISAM ? You probably forgot to adjust this value. Default value of 1 will mean each update transaction commit (or each statement outside of transaction) will need to flush log to the disk which is rather expensive, especially if you do not have Battery backed up cache. Many applications, especially those moved from MyISAM tables are OK with value 2 which means do not flush log to the disk but only flush it to OS cache. The log is still flushed to the disk each second so you normally would not loose more than 1-2 sec worth of updates. Value 0 is a bit faster but is a bit less secure as you can lose transactions even in case MySQL Server crashes. Value 2 only cause data loss with full OS crash.

table_cache - Opening tables can be expensive. For example MyISAM tables mark MYI header to mark table as currently in use. You do not want this to happen so frequently and it is typically best to size your cache so it is large enough to keep most of your tables open. It uses some OS resources and some memory but for modern hardware it is typically not the problem. 1024 is good value for applications with couple hundreds tables (remember each connection needs its own entry) if you have many connections or many tables increase it larger. I’ve seen values over 100.000 used.

thread_cache Thread creation/destructions can be expensive, which happen at each connect/disconnect. I normally set this value to at least 16. If application has large jumps in amount of concurrent connections and I see fast growth of
Threads_Created variable I boost it higher. The goal is not to have threads created in normal operation.

query_cache_size If your application is read intensive and you do not have application level caches this can be great help. Do not set it too large as it may slow things down as its maintenance may get expensive. Values from 32M to 512M normally make sense. Check it however after a while and see if it is well used. For certain workloads cache hit ratio is lower than would justify having it enabled.

Note: as you can see all of these are global variables. These variables depend on hardware and mix of storage engines, while per session variables are typically workload specific. If you have simple queries there is no reason to increase sort_buffer_size even if you have 64GB of memory to waste. Furthermore doing so may decrease performance.
I normally leave per session variable tuning to second step after I can analyze workload.

P.S Note MySQL distribution contains bunch of sample my.cnf files which may be great templates to use. Typically they would already be much better than defaults if you chose correct one.

September 27, 2006

APC or Memcached

Posted by peter

A while back I did Cache Performance Comparison for LAMP Stack.

Looking at this data you can see memcached is about 5 times as slow as APC, and this is with tests done on localhost - with network difference is going to be larger, even with fastest network. Such latency can add up especially if you’re, being lazy “P” Developer, request objects from cache one by one rather than fetching all items you need for the page at once (not to mention this is not always possible).

So I thought if there is any way to use both of them at once benefiting from strong sides of each of them.
[read more...]

Life Beyond MySQL

Posted by peter

When I was leaving MySQL I wrote I will be doing some Web projects on my own, besides doing MySQL Consulting. I think some people thought I am just using that as an excuse and really I will only do consulting. So I’m pretty happy to announce my first project.

NNSEEK Is NewsGroup search engine, as well as NewsGroup->Blog converter, allowing people to view NewsGroups as blog, meaning threads starts are showed as posts and replies are shown as comments.

It is quite early version so at this point - it has number of little problems, rough edges, it is also not overly optimized in terms of performance yet, but it is already becoming fun to play with.

Much more features to come in the future. We’re planning to localize it (at least in Russian), add some more fun features, add web services API so it can be integrated with other systems etc. If you have some ideas what else would be good to add drop us a note.

We’re also interested in more data so if you have some newsgroups archive (non binary) or can peer with us let me know. We’re especially interested in local groups and gateways which have newsgroups not available from large peers.

At this point we have some 20GB of data indexes and few hundred megs of new data per day. We should have 10 times size of data in a few months.

If you’re interested to know what this size is developed on - it uses Linux/Apache/PHP/MySQL of course plus Sphinx for full text search.

September 22, 2006

Test Drive of Solid

Posted by Vadim

Not so long ago Solid released solidDB for MySQL Beta 3 so I decided now is time to take a bit closer look on new transactional engine for MySQL. While my far goal is the performance and scalability testing before I wanted to look at basic transactional properties such as deadlock detection, select for update handling and phantom reads in the repeatable read isolation level.

[read more...]

EuroOSCON 2006 - High Performance FullText Search

Posted by peter

I’m now back from EuroOSCON 2006 which was the reason I was not posting for a while. Pretty interesting event, even though it looks like it is getting less geeky compared to OSCON in US I visited two years ago - a lot of presentations now shifted to philosophical, political and business issues. I however do not know might be this is just Europe thing.

I gave a talk on High Performance FullText Search for Database Content which is now available for download from MySQL Performance Presentations page

[read more...]

September 12, 2006

Database access Optimization in Web Applications.

Posted by peter

This is pretty simple approach I often use called to optimize web application performance if problem happens with few pages. If we have “everything is slow” problem looking at slow query logs may be better start.

So what could you do ?

Look at the information shown on the page which comes from database. This is important as this is information you need to have retrieved from database one way or another if you want to keep page look the same. Your task is simply to have this information delivered from the database most efficient way.

Investigate how dynamic is this information If information rarely changes or is static it may be better to pre-create or cache it. There are many cache and pre-creation techniques which you can use. Just remember avoiding accessing database is the best way you can optimize database access. This applies to anything else - if you can avoid dynamic page generation at all and have server cache to serve it it is even better.

Check if information retrieved from the database matches information you display Way to often much more information is retrieved from the database than used to generate the page. It can be mild as SELECT * FROM tbl instead of listing columns you actually need or it can be as bad as running SELECT * FROM tbl to count amount of rows in the table (no joking). Sometimes you would see 100 stories selected so one random of them will be displayed and similar things with filtering on application level. Sometimes you can do it more efficient on application level but generally you should try your queries to return you only information you need.

Check number of rows which are used to generate result set This is very important and often forgotten step. Some people think query is simple if it returns few rows while what really matters is amount of rows are analyzed by the query. For example SELECT COUNT(*) FROM links WHERE domain = ‘mysql.com’; will return only one row while may require to scan hundreds of thousands of rows (or index entries) to do so. Other common killer queries are GROUP BY Queries and Sort Queries - SELECT name,descr FROM titles ORDER BY rank DESC LIMIT 10 - If there is no proper index and query has to use “filesort” it may become a bit problem. Other thing to mention is JOINs - Joins are expensive (relatively of course) and for sure increase amount of rows which need to be used to generate result set - if you have to join 10 tables to compose the object it is much slower than getting the same data by single row read.

Check number of rows which actually need to be used to generate result sets Sometimes query needs to use many rows to provide result set just because schema is not optimally indexed - this is easy. For example our ORDER BY rank query is such - adding index on rank column maxes this query to use 10 rows to return 10 rows - exactly what we’re looking for. However our COUNT(*) query is different - even if you have index on domain it still can require a lot of rows scanned to provide result set. Such queries need to be redesigned rather than simply adjusted - for example summary table which holds number of links per domain would do in this case.

Check number of queries If you can retrieve same amount of data with one query it is better than doing so by several queries, assuming this query would not need to analyze much more rows because it is optimized differently. One of typical examples here would be SELECT * FROM tbl WHERE id=5 executed many times with different constants - it is worth to replace such queries with query using IN(5,7,4,56) instead. Do not however be obsessed with this. I’ve seen people trying to join all queries in single UNION (with padding to accommodate different types and number of columns) - this is not a good practice. However if you can reduce number of queries without a lot of complications to application architecture it is good thing to do.

September 9, 2006

Alternatives of PHP ?

Posted by Vadim

When I loaded GigaBytes of XMLs into mysql database with PHP script, I was thinking about PHP alternatives. Why do I need that ?
1. PHP is slow - I’m speaking about area of data processing and implementation of algorithms
2. No good cli debugger - I’m just tired of debugging with ‘echo’ and ‘var_dump’
3. Unpredictable memory consumption - it’s easy in processing of big files to eat all available memory
4. Need something new - I’ve been using PHP for almost 10 years, so I want to try something else to refresh my mind.
[read more...]

September 8, 2006

Why Index could refuse to work ?

Posted by peter

Have you ever seen index which refused to be used even if there is every reason for it to work (from the glance view):

SQL:
  1. mysql> EXPLAIN SELECT * FROM article WHERE article_id=10;
  2. +----+-------------+---------+------+---------------+------+---------+------+-------+-------------+
  3. | id | select_type | TABLE   | type | possible_keys | KEY  | key_len | ref  | rows  | Extra       |
  4. +----+-------------+---------+------+---------------+------+---------+------+-------+-------------+
  5. 1 | SIMPLE      | article | ALL  | PRIMARY       | NULL | NULL    | NULL | 93490 | USING WHERE |
  6. +----+-------------+---------+------+---------------+------+---------+------+-------+-------------+
  7. 1 row IN SET (0.00 sec)

Why on the earth index would not be used you would think, even if MySQL is mentioning it in "possible keys" ? Should you try to force it ?

[read more...]

September 7, 2006

Internals of InnoDB mutexes

Posted by Vadim

InnoDB uses its own mutexes and read-write locks instead of POSIX-mutexes pthread_mutex*, the main reason for that is performance, but InnoDB's implementation isn't ideal and on modern SMP boxes can cause serious performance problems.
Let's look on InnoDB mutex (schematic for simplification):

[read more...]

September 6, 2006

Wrong GROUP BY makes your queries fragile

Posted by peter

This is probably well known issue for everyone having some MySQL experience or experience with any other SQL database. Still I see this problem in many production applications so it is worth to mention it, especially as it is connected to MySQL Performance. No it might not affect MySQL Performance per say but it limits our ability tune MySQL Performance as queries become fragile - changing execution plan leads to different query results.

So what I'm speaking about ?

Lets say you have query something like SELECT A,B,MAX(C) FROM TBL GROUP BY A - what would you expect from such query ? Column A is part of group by so its value is same for whole group. MAX(C) is also particular value for each group, while B is not part of GROUP BY and may well correspond to different values. Which one are you looking to get ? In fact this is where results becomes non-deterministic and fragile - any B from the group could be returned, while you might be expecting some particular one.

Many other DBMS and ANSI SQL Simply forbids such queries, MySQL is more permissive and will return you first B it runs into. This value however can be dependent on selected execution plan and change if you add some indexes. Also plan may change as database statistics changes and your query may become broken without any reason.

My Advice is to stay away from such group by statements, however as usually there are exceptions and such GROUP BY statements may be faster than alternatives. Assume for example we have non-normalized table (or join result) where two columns have 1-1 relationship. For example we have user id and login which are both unique. In this case running

SELECT id,login,max(login_time) FROM log GROUP BY id,login May be replaced by SELECT id,login,max(login_time) FROM log GROUP BY id which may be faster especially if GROUP BY is executed via sorting.

In this case even though we do not know which value from the group will be returned we do not care as it will be same for all rows.