InnoDB: look after fragmentation

One problem made me puzzled for couple hours, but it was really interesting to figure out what’s going on.

So let me introduce problem at first. The table is

CREATE TABLE `c` (
  `tracker_id` int(10) unsigned NOT NULL,
  `username` char(20) character set latin1 collate latin1_bin NOT NULL,
  `time_id` date NOT NULL,
  `block_id` int(10) unsigned default NULL,
  PRIMARY KEY  (`tracker_id`,`username`,`time_id`),
  KEY `block_id` (`block_id`)
) ENGINE=InnoDB

CREATE TABLE `c` (

`tracker_id` int(10) unsigned NOT NULL,

`username` char(20) character set latin1 collate latin1_bin NOT NULL,

`time_id` date NOT NULL,

`block_id` int(10) unsigned default NULL,

PRIMARY KEY (`tracker_id`,`username`,`time_id`),

KEY `block_id` (`block_id`)

) ENGINE=InnoDB

Table has 11864696 rows and takes Data_length: 698,351,616 bytes on disk

The problem is that after restoring table from mysqldump, the query that scans data by primary key was slow. How slow ? Let me show.

The query in question is (Q1):

SELECT count(distinct username) FROM  tracker where TIME_ID >= '2009-07-20 00:00:00' AND TIME_ID <= '2009-10-21 00:00:00' AND (tracker_id=437)

1	SELECT count(distinct username) FROM tracker where TIME_ID >= '2009-07-20 00:00:00' AND TIME_ID <= '2009-10-21 00:00:00' AND (tracker_id=437)

On cold buffer_pool, it took:

+---------------------------+
| count(distinct username) |
+---------------------------+
|                   5856156 | 
+---------------------------+
1 row in set (4 min 13.61 sec)

+---------------------------+

| count(distinct username) |

+---------------------------+

| 5856156 |

+---------------------------+

1 row in set (4 min 13.61 sec)

However the query (again on cold buffer_pool) (Q2)

SELECT count(distinct username) FROM  tracker where TIME_ID >= '2009-07-20 00:00:00' AND TIME_ID <= '2009-10-21 00:00:00'

1	SELECT count(distinct username) FROM tracker where TIME_ID >= '2009-07-20 00:00:00' AND TIME_ID <= '2009-10-21 00:00:00'

+---------------------------+
| count(distinct username) |
+---------------------------+
|                   5903053 | 
+---------------------------+
1 row in set (18.81 sec)

+---------------------------+

| count(distinct username) |

+---------------------------+

| 5903053 |

+---------------------------+

1 row in set (18.81 sec)

Difference is impressive. 4 min 13.61 sec vs 18.81 sec

If you want EXPLAIN plain, here it is:

For Q1:

+----+-------------+-------------------------+------+---------------+---------+---------+-------+---------+--------------------------+
| id | select_type | table                   | type | possible_keys | key     | key_len | ref   | rows    | Extra                    |
+----+-------------+-------------------------+------+---------------+---------+---------+-------+---------+--------------------------+
|  1 | SIMPLE      | tracker  | ref  | PRIMARY       | PRIMARY | 4       | const | 6880241 | Using where; Using index | 
+----+-------------+-------------------------+------+---------------+---------+---------+-------+---------+--------------------------+
1 row in set (0.02 sec)

+----+-------------+-------------------------+------+---------------+---------+---------+-------+---------+--------------------------+

+----+-------------+-------------------------+------+---------------+---------+---------+-------+---------+--------------------------+

+----+-------------+-------------------------+------+---------------+---------+---------+-------+---------+--------------------------+

1 row in set (0.02 sec)

For Q2:

+----+-------------+-------------------------+-------+---------------+-------------------------------------+---------+------+----------+--------------------------+
| id | select_type | table                   | type  | possible_keys | key                                 | key_len | ref  | rows     | Extra                    |
+----+-------------+-------------------------+-------+---------------+-------------------------------------+---------+------+----------+--------------------------+
|  1 | SIMPLE      | tracker | index | NULL          | block_id | 5       | NULL | 13760483 | Using where; Using index | 
+----+-------------+-------------------------+-------+---------------+-------------------------------------+---------+------+----------+--------------------------+

+----+-------------+-------------------------+-------+---------------+-------------------------------------+---------+------+----------+--------------------------+

+----+-------------+-------------------------+-------+---------------+-------------------------------------+---------+------+----------+--------------------------+

+----+-------------+-------------------------+-------+---------------+-------------------------------------+---------+------+----------+--------------------------+

Query Q1 is executed using Primary Key, and Query Q2 is using block_id key.

To get more details I ran both queries with our extended stats in slow.log (available in 5.0-percona releases)

So for query Q1:

# Query_time: 253.643162  Lock_time: 0.000137  Rows_sent: 1  Rows_examined: 11569733  Rows_affected: 0  Rows_read: 11569733
#   InnoDB_IO_r_ops: 73916  InnoDB_IO_r_bytes: 1211039744  InnoDB_IO_r_wait: 236.149003
#   InnoDB_rec_lock_wait: 0.000000  InnoDB_queue_wait: 0.000000
#   InnoDB_pages_distinct: 54838

# Query_time: 253.643162 Lock_time: 0.000137 Rows_sent: 1 Rows_examined: 11569733 Rows_affected: 0 Rows_read: 11569733

# InnoDB_IO_r_ops: 73916 InnoDB_IO_r_bytes: 1211039744 InnoDB_IO_r_wait: 236.149003

# InnoDB_rec_lock_wait: 0.000000 InnoDB_queue_wait: 0.000000

# InnoDB_pages_distinct: 54838

And for query Q2:

# Query_time: 18.846855  Lock_time: 0.000123  Rows_sent: 1  Rows_examined: 11864696  Rows_affected: 0  Rows_read: 11864696
#   InnoDB_IO_r_ops: 27510  InnoDB_IO_r_bytes: 450723840  InnoDB_IO_r_wait: 0.165124
#   InnoDB_rec_lock_wait: 0.000000  InnoDB_queue_wait: 0.000000
#   InnoDB_pages_distinct: 24687

# Query_time: 18.846855 Lock_time: 0.000123 Rows_sent: 1 Rows_examined: 11864696 Rows_affected: 0 Rows_read: 11864696

# InnoDB_IO_r_ops: 27510 InnoDB_IO_r_bytes: 450723840 InnoDB_IO_r_wait: 0.165124

# InnoDB_rec_lock_wait: 0.000000 InnoDB_queue_wait: 0.000000

# InnoDB_pages_distinct: 24687

As you see for Q1 IO read took 236.149003 sec vs 0.165124 for Q2. But Q1 is scan by primary key, which supposed to be
sequential!

Let’s see on another statistic, which available in innodb_check_fragmentation patch:

for Q1:

SHOW STATUS LIKE 'Innodb_scan_pages%';
+------------------------------+-------+
| Variable_name                | Value |
+------------------------------+-------+
| Innodb_scan_pages_contiguous | 88    | 
| Innodb_scan_pages_jumpy      | 73789 | 
+------------------------------+-------+
2 rows in set (0.00 sec)

SHOW STATUS LIKE 'Innodb_scan_pages%';

+------------------------------+-------+

| Variable_name | Value |

+------------------------------+-------+

| Innodb_scan_pages_contiguous | 88 |

| Innodb_scan_pages_jumpy | 73789 |

+------------------------------+-------+

2 rows in set (0.00 sec)

and for Q2:

mysql> SHOW STATUS LIKE 'Innodb_scan_pages%';        
+------------------------------+-------+
| Variable_name                | Value |
+------------------------------+-------+
| Innodb_scan_pages_contiguous | 26959 | 
| Innodb_scan_pages_jumpy      | 442   | 
+------------------------------+-------+
2 rows in set (0.00 sec)

mysql> SHOW STATUS LIKE 'Innodb_scan_pages%';

+------------------------------+-------+

| Variable_name | Value |

+------------------------------+-------+

| Innodb_scan_pages_contiguous | 26959 |

| Innodb_scan_pages_jumpy | 442 |

+------------------------------+-------+

2 rows in set (0.00 sec)

So you see for Q1 it was not sequential scan, even it is primary key, but it is sequential for Q2.

So what’s the answer ? It’s fragmentation of primary key (and whole data table, as InnoDB data == primary key). But how it could happen with
primary key after mysqldump ? The answer here if we look on

EXPLAIN SELECT * FROM tracker;

+----+-------------+-------------------------+-------+---------------+-------------------------------------+---------+------+----------+-------------+
| id | select_type | table                   | type  | possible_keys | key                                 | key_len | ref  | rows     | Extra       |
+----+-------------+-------------------------+-------+---------------+-------------------------------------+---------+------+----------+-------------+
|  1 | SIMPLE      | tracker | index | NULL          | block_id | 5       | NULL | 13760483 | Using index | 
+----+-------------+-------------------------+-------+---------------+-------------------------------------+---------+------+----------+-------------+
1 row in set (0.00 sec)

+----+-------------+-------------------------+-------+---------------+-------------------------------------+---------+------+----------+-------------+

+----+-------------+-------------------------+-------+---------------+-------------------------------------+---------+------+----------+-------------+

+----+-------------+-------------------------+-------+---------------+-------------------------------------+---------+------+----------+-------------+

1 row in set (0.00 sec)

We see that dump is taken in key “block_id” order, not in primary key order. And later when we load this table, INSERTS into primary key happens in random order, and that gives us the fragmentation we see here.

How to fix it in our case. It’s easy:

ALTER TABLE tracker ENGINE=InnoDB

1	ALTER TABLE tracker ENGINE=InnoDB

, it will force InnoDB to rebuild table in primary key order.

After that Q1:

+---------------------------+
| count(distinct username) |
+---------------------------+
|                   5856156 | 
+---------------------------+
1 row in set (17.72 sec)

mysql> SHOW STATUS LIKE 'Innodb_scan_pages%';
+------------------------------+-------+
| Variable_name                | Value |
+------------------------------+-------+
| Innodb_scan_pages_contiguous | 37864 | 
| Innodb_scan_pages_jumpy      | 574   | 
+------------------------------+-------+
2 rows in set (0.00 sec)

and extended stats:
# Query_time: 17.765369  Lock_time: 0.000137  Rows_sent: 1  Rows_examined: 11569733  Rows_affected: 0  Rows_read: 11569733
#   InnoDB_IO_r_ops: 38530  InnoDB_IO_r_bytes: 631275520  InnoDB_IO_r_wait: 0.204893
#   InnoDB_rec_lock_wait: 0.000000  InnoDB_queue_wait: 0.000000
#   InnoDB_pages_distinct: 35584

+---------------------------+

| count(distinct username) |

+---------------------------+

| 5856156 |

+---------------------------+

1 row in set (17.72 sec)

mysql> SHOW STATUS LIKE 'Innodb_scan_pages%';

+------------------------------+-------+

| Variable_name | Value |

+------------------------------+-------+

| Innodb_scan_pages_contiguous | 37864 |

| Innodb_scan_pages_jumpy | 574 |

+------------------------------+-------+

2 rows in set (0.00 sec)

and extended stats:

# Query_time: 17.765369 Lock_time: 0.000137 Rows_sent: 1 Rows_examined: 11569733 Rows_affected: 0 Rows_read: 11569733

# InnoDB_IO_r_ops: 38530 InnoDB_IO_r_bytes: 631275520 InnoDB_IO_r_wait: 0.204893

# InnoDB_rec_lock_wait: 0.000000 InnoDB_queue_wait: 0.000000

# InnoDB_pages_distinct: 35584

You see that time returned to appropriate 17.72 sec.

You may ask what happens now with Q2 ? yes, it’s getting slow now, as we made key “block_id” inserted not in order.

+---------------------------+
| count(distinct username) |
+---------------------------+
|                   5903053 |
+---------------------------+
1 row in set (2 min 8.92 sec)
mysql> SHOW STATUS LIKE 'Innodb_scan_pages%';
+------------------------------+-------+
| Variable_name                | Value |
+------------------------------+-------+
| Innodb_scan_pages_contiguous | 45    | 
| Innodb_scan_pages_jumpy      | 35904 | 
+------------------------------+-------+
2 rows in set (0.00 sec)

+---------------------------+

| count(distinct username) |

+---------------------------+

| 5903053 |

+---------------------------+

1 row in set (2 min 8.92 sec)

mysql> SHOW STATUS LIKE 'Innodb_scan_pages%';

+------------------------------+-------+

| Variable_name | Value |

+------------------------------+-------+

| Innodb_scan_pages_contiguous | 45 |

| Innodb_scan_pages_jumpy | 35904 |

+------------------------------+-------+

2 rows in set (0.00 sec)

As for mysqldump you may use

--order-by-primary

1	--order-by-primary

options to force dump in primary key order.

So notes to highlight:

InnoDB fragmentation may hurt your query significantly, especially when data is not in buffer_pool and execution goes to read from disk
Fragmentation by secondary key is much more likely than by primary key, and you cannot really control it (tough it is possible in XtraDB / InnoDB-plugin with FAST INDEX creation) so be careful with queries scan many records by secondary key
To check if you query affected by fragmentation you can use
Innodb_scan_pages_contiguous ; Innodb_scan_pages_jumpy
1
Innodb_scan_pages_contiguous ; Innodb_scan_pages_jumpy
statistics in 5.0-percona releases (coming to 5.1-XtraDB soon)

13 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Peter Zaitsev

Admin

14 years ago

Vadim,

I would mention it looks like it is caused by the bug to start with.

SELECT * FROM table

should not be using index but rather primary key. The data is clustered by primary key so it is generally faster.
Also SELECT * FROM table can be only “using index” if there is a key which (together with primary key) covers all columns which means it is not going to be shorter than clustered (primary key) and so even in case they are both fragmented it should not be slower.

I suspect the problem here may come with costs – because Innodb stats are inexact it may look like there are a lot less records to scan using one index than other, while of course if you’re doing full table scan or full index scan number of records is going to be same.

Great fine BTW and nice to see our stats are working 🙂

Arjen Lentz

14 years ago

Interesting case.
Technically it was an equal choice in terms of index selection, since the indexed column (block_id) plus the composite primary key covers all the columns. So it was just as valid to pick that index as it was to just scan the primary key. And it would be just as fast.

Of course, when using the output from that query to insert into another table (via a dump or just INSERT … SELECT), and the destination table is InnoDB, then it’s more beneficial to have the select in primary key order. But hey, we all know that unless you specify an ORDER BY, result set order is undefined.
I don’t think this is a optimiser bug, really. If you want the select ordered, use an ORDER BY clause.

And indeed, mysqldump offers the –order-by-primary option, and using it with InnoDB is good. For other engines, the situation can be quite different. For MyISAM, ordering by primary may in fact trigger a filesort (even go to disk) since a full table scan is likely not in primary key order. It’s arbitrary (well, based on insert order + gap filling).

Peter Zaitsev

Admin

14 years ago

Arjen,

Right. From the size prospective the indexes are about the same. Primary key is even a bit larger as it has more information in it than primary key. However it is still better to scan in primary key order in most cases because it is less fragmented. It is very frequent to see auto increment primary key or other sequential insert patterns.

Miguel DeAvila

14 years ago

Can you say more about the nature of the fragmentation?

Even if the dump occurred in non-pk order, the pk order should have been restored during the import, no?
I thought that the price for the out-of-order restore would be excessive i/o during the restore as the
b-tree is continually re-organized to cope with the out-of-order inserts. After the restore the table must
be in pk order, no?

Steven Roussey

14 years ago

One thing that was great about mysisam was ALTER TABLE ORDER BY, so you could decide how to order your data…

Arjen Lentz

14 years ago

Migael, in InnoDB’s architecture, it works out faster to insert in primary key order. Yes the B+Tree gets rebalanced either way and a B+tree is by definition sorted, but there’s just less work in this particular scenario. That’s why it’s desirable that a dump for InnoDB tables has the rows in PK order.

Peter Zaitsev

Admin

14 years ago

Miguel,

Indeed Innodb clusters data by primary key. This clustering is however per page. For example in case we would use MyISAM we could get a “disk seek” and IO for each row in worse case scenario – in Innodb it does not happen. The data is always going to be clustered on per page basics and hence at least 16K worth of data would be read each time. The number of rows of course depends on row length and page fill factor – in this case we had few hundred rows per page.

Now note in the worse case scenario no read-aheads will trigger and all IO will be done by single thread in 16K blocks. Considering 200 IOS/sec for legacy (non Flash) hard drive you will be looking at about 3MB/sec read speed which is 30-50 times slower than sequential read speed of the same drive.

Frank

14 years ago

Vadim
in this case / with that special purpose in mind why do you compare the runtimes of different queries?
Q1=
SELECT count(distinct username) FROM tracker where TIME_ID >= ‘2009-07-20 00:00:00’ AND TIME_ID = ‘2009-07-20 00:00:00’ AND TIME_ID <= '2009-10-21 00:00:00'

At first glance one would use identical queries and FORCE INDEX usage, wouldn't one?

Vadim

14 years ago

Frank,

I compared different queries as they use different indexes.

Q1 uses primary key and Q2 uses block_id

The idea behind that is if you run query that you think should be executed by PRIMARY KEY you will not
use FORCE KEY (secondary_key) on it. By PRIMARY KEY is usually much faster for InnoDB.

zanzibar

14 years ago

I just ran ‘ALTER TABLE tracker ENGINE=InnoDB’; it takes forever with 20 million rows. I wonder if there’s a better way to defrag on production! 🙂

Petrik_CZ

14 years ago

After several trials I am doing “defragmentation” of innodb 10M+ rows table by exporting to csv file and then loading to new table. Then I rename them, copy all changes which occured during export/import and delete old table. quite fast compared to other methods.

Sheeri

14 years ago

What happened to these variables in later versions? In version Server version: 5.1.55-rel12.6-log Percona Server (GPL), 12.6, Revision 200 the “innodb_scan%” status variables do not exist, but in Server version: 5.0.92-50-log Percona SQL Server, Revision 85 (GPL) they do exist.

Furthermore, the patch that contains these is documented https://www.percona.com/docs/wiki/patches:innodb_check_fragmentation but there’s no indication they are deprecated…..or what replaces them….and sadly, https://www.percona.com/docs/wiki/percona-server:features:indexes:variable_reference does not reference those variables or the patch itself.

What happened? Is there another place I can find this information?

larry liu

4 years ago

Does innodb rebalance in background or real time before write operation finish?

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

InnoDB: look after fragmentation

Related

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis: Configuration Best Practices

Valkey/Redis: The Hash Datatype

Valkey/Redis Replication and Auto-Failover With Sentinel Service

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

InnoDB: look after fragmentation

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Valkey/Redis: Configuration Best Practices

Valkey/Redis: The Hash Datatype

Valkey/Redis Replication and Auto-Failover With Sentinel Service

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation