It is a known fact that ext3 is not the most efficient file system out there and for example file removals can be painfully slow and cause a lot of random I/O. However, as it turns out, it can sometimes have a much more severe impact on the MySQL performance that it would seem. When or why?
When you run
1 | DROP TABLE |
there are several things that need to happen – write lock on a table so it cannot be used by any other thread, the data file(s) removal by the storage engine and of course in the end MySQL has to destroy the definition file (.frm). That’s not all that happens, there is one other thing:
1 2 3 | VOID(pthread_mutex_lock(&LOCK_open)); error= mysql_rm_table_part2(thd, tables, if_exists, drop_temporary, 0, 0); pthread_mutex_unlock(&LOCK_open); |
The entire code which removes a table is surrounded by
1 | LOCK_open |
mutex. The mutex is used in various places inside MySQL, but primarily when tables are being opened or closed. It means that when
1 | LOCK_open |
is locked, no query can be executed because they are stopped from accessing any table.
Here is when the slow file removal on the ext3 file system starts to be a pain. Deleting a 10GB file can take a few seconds and if that is a MySQL table, the mutex remains locked for all that time stalling all queries:
1 2 3 4 5 6 | +-----+------+-----------+------+---------+------+----------------+------------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +-----+------+-----------+------+---------+------+----------------+------------------------------------------------+ | 1 | root | localhost | test | Query | 7 | NULL | drop table large_table | | 329 | root | localhost | test | Query | 7 | Opening tables | select sql_no_cache * from other_table limit 1 | +-----+------+-----------+------+---------+------+----------------+------------------------------------------------+ |
I tried some alternative approaches to trick MySQL into removing smaller files with
1 | DROP TABLE |
to minimize the effect, such as:
- TRUNCATE TABLE large_table; ALTER TABLE large_table ENGINE=…; DROP TABLE large_table;
- TRUNCATE TABLE large_table; OPTIMIZE TABLE large_table; DROP TABLE large_table;
Unfortunately as it turned out each of the administrative commands like
1 | ALTER TABLE |
or
1 | OPTIMIZE TABLE |
one way or another uses
1 | LOCK_open |
mutex when the old table files are deleted:
1 2 | | 3 | root | localhost | test | Query | 7 | rename result table | ALTER TABLE large_table ENGINE=MyISAM | | 679 | root | localhost | test | Query | 6 | Opening tables | select * from other_table limit 1 | |
The only alternative seems to be changing the file system. To XFS for example, which handles file removals much more efficiently:
EXT3
1 2 | mysql> drop table large_table; Query OK, 0 rows affected (7.44 sec) |
XFS
1 2 | mysql> drop table large_table; Query OK, 0 rows affected (0.29 sec) |
A better solution through MySQL internals could be to simulate the table drop by renaming the corresponding data file or files and physically removing them outside of the mutex lock. However it may not be that simple, because the actual removal is performed by the storage engine, so it’s not something MySQL code can control.
This is certainly not a common situation, but may become a problem to anyone when it’s the least expected (e.g. dropping old unused tables).
And what about ext4 ? Since some months all my new MySQL servers use only ext4 instead of xfs ; is it a wrong choice ?
Oliver,
I have seen ext4 with extents enabled to show a significant improvement over ext3 in large file removals, but still far behind xfs. It was a while ago though, so things could have changed since then.
Maciek
Surely the correct solution is, as you say, to change the metadata (quickly) and let the file system operations happen asynchronously. Changing filesystems to solve this sounds pretty backasswards. (That said, how does reiser3 do here?)
A solution from a completely different field.
The developers of mythtv (a TV recording system) achieved better performing deletes of large files on ext3 through the “slow delete” feature. Typically TV recordings are 1-12GB per hour, so a movie could be 20+GB in size. Without the feature, a delete of the file on ext3 would lock up the entire system for about 20-30 seconds.
The “slow delete” simply truncates the files in chunks in the background.
Toby,
I haven’t used or even seen reiserfs in ages, but if I remember correctly it is not good with larger files.
Maciek
Richard,
I think that’s probably another step which could be considered to reduce the load spikes, however the primary problem is doing it inside a global mutex lock.
Maciek
Does this also apply to temp tables? Seems like this could be a huge problem and would possibly cause a large sort to block other arbitrary select queries?
Generally, temp tables are going to show up in /tmp which is usually tempfs. I wouldn’t expect to see slow delete mechanics at work in that case since you’re basically memory mapped although I’ll admit to not having run a test :).
For me there is 3 cases where tables are created in /tmp :
– when you use “create temporary table”
– when MySQL need a temporary table which use BLOB or TEXT column
– when MySQL need a temporary table which is bigger than tmp_table_size/max_heap_table_size
Right ?
And because of this last case, we can’t use /dev/shm to store temporary tables : it may use too much memory.
Maybe I’m wrong here (its been a while since I looked seriously at filesystems), but isn’t tempfs always implemented as virtual memory on linux?
http://en.wikipedia.org/wiki/TMPFS
This issue has been a big problem for us for a long time as we constantly create and drop schemas. I worked around the issue by having a special method that drops tables one at a time and sleeps if the process count gets too high on the master or slaves. Though the solution is far from ideal as large tables still lock things up for too long.
See the mysql Worklog task for this at: forge.mysql.com/worklog/task.php?id=3983
On our production redhat machines at least it seems to be ext3, not sure if that’s default or a quirk of our particular setup though.
/dev/vg00/lvtmp /tmp ext3 defaults 1 2
And indeed tmpfs is virtual memory (RAM+SWAP) backed (http://en.wikipedia.org/wiki/TMPFS), however for a dedicated mysql machine that doesn’t make too much sense, since if you had the ram you’d just increase your tmp_table_size etc to use that ram directly i imagine. I don’t mind that these large sorts use disk, but i do mind if it blocks my other select queries for no good reason. As a general rule, locks should absolutely never be held around any I/O operation…
Patrick E,
I haven’t examined the sources for temp tables, so I do not know if they suffer from a similar problem. I suppose since they do not require such locking, because there is no possibility for any concurrent access to them, MySQL may do things differently. However obviously if there is a large temporary table located on ext3 partition, it will cause a lot of random I/O on delete thus affecting the system performance.
Maciek
pat,
Actually typically /tmp is created on the system storage (i.e. local disk) using whatever file system chosen for the installation. To boost some occasional but rather smallish temporary tables some people move it to tmpfs (which I think is unreasonable design, because you would rather want MySQL tmpdir on tmpfs rather than the system-wide /tmp).
Maciek
To explain our rationale for not using tmpfs, and using a disk store like ext3:
1) Our database is typically 100% cached (both data and indexes), 16 – 20 GB in a few hundred MyISAM tables (out of 24 GB RAM on the machine). It is 99.9999% reads on a very static but highly random set of data.
2) Most sorts are small or medium sized, we have max_heap_table_size=100M and tmp_table_size=100M in order to do these in memory
3) We have occasional large (2-4 GB +) sorts that we’d like to spill to disk. It’s okay if these are somewhat slow. I/O contention is not an issues since all reads are cached and the few writes we have are not performance critical.
I’m just wondering if in this case this lock will bite us for these large sorts (which afaics, we definitely do need on disk, since our ram is mostly dedicated to caching the db / indexes). I’d go spelunking, but have no experience with the mysql codebase really. Just a concerned user here 🙂
Maciek,
I should note this does not only applies to MyISAM tables but to Innodb tables in innodb_file_per_table mode as well.
I recently spoke to the customer which simply can’t drop 400G table because it stalls everything for long time.
See mysql work log on this: http://forge.mysql.com/worklog/task.php?id=3983
Maciej,
you’re testing this with ext3’s journal data, ordered data or write-back mode? As the default is ordered data and the closest behavior to XFS can be achieved via the write-back mode (which btw will become the default soon).
Note that ext4 as it is included in RHEL5.3 is clearly marked as a technology preview, not production quality. But since they refused to support XFS in favor of the outdated ext3 for years now, the best bet for heavy I/O machines is Novell SLES anyways 🙂
Michael
I’ve found doing a TRUNCATE on a MYISAM table will resize the .MYD file without a lock. You can then drop the table instanstly. With InnoDB tables, the .IBD file doesn’t get resized/deleted until you OPTIMIZE or DROP.
If you’re using percona xtradb binaries you can use this option to reduce the pain.
https://www.percona.com/doc/percona-server/5.1/performance/innodb_lazy_drop_table.html
‘njoy,
Mark
You can create hard link of table file, and drop table.
drop will only minus 1 from table file’s count, it doesn’t lock table very long time.
And then, we can delete hard link, which will doesn’t lock table.
This is still a problem even after 10 years. I recently experienced this with a degrade in the performance when dropping a database which is hosted along with 10 other databases on the node and almost everything stalled for 3 minutes with a rapid increase of connections from 80 to 500. I am using MySQL 5.5.19