Have you ever had a question about Innodb internal design or behavior which is not well covered in MySQL manual ? I surely had.
Now you have a great chance to have them answered !

Heikki Tuuri, Creator of Innodb will answer your Questions about Innodb at MySQL Performance Blog.

Please leave your questions as comments to this post by 5th of October and I will pass them to Heikki to reply merging with questions I have myself.

Note: due to Oracle policies Heikki will likely be unable to answer your questions about Innodb new features or timetables.

49 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Alexander Rubin

How InnoBD handles blobs/text fields (needs more info)

According to the documentation, InnoDB put first 768 chars of each text/blob in the page
and will allocate some space outside of the page.

However:
1. if the total size of the row is less than 8100 bytes InnoDB will not allocate additional
space for blobs, even if each blob is larger than 768.
2. InnoDB will allocate additional space outside of the page for _each_ blob (if we have 8
blobs 8xN bytes will be allocated)

Question:
How much space InnoDB allocates for each blob outside of the page?

Test case in: http://bugs.mysql.com/bug.php?id=29042

mike

How does MySQL Cluster help with big tables and lots of write activity?
Especially how does it help with the foreground log i/o writes? How can
MySQL Cluster spread out log i/o writes, etc…?

mike

Is there any advantage to using prepared statements in Java for DML in the following scenario:
get a connection, prepare a statement, substitute the variable value, execute, close the statement –
repeat this process over and over.

Don’t you need to not close so that you don’t have to re-prepare the statement
in order to gain any performance advantage?

Thanks

mike

Sorry. I’ll post those elsewhere.

Innodb Q: Is Group commit still busted in 5.0/5.1?

Kevin Burton

Mike…. RE #3

You should use a connection pool to avoid opening and closing lots of connections.

mike

Hi Kevin,

Using a connection pool. Unless you don’t close the statement, its my belief
that prepared statements don’t buy you anything. The issue is opening and closing
the statement defeats the purpose of a prepared statement.

Anyway, I moved the question to the MySQL forum here, so we don’t plug up the intended
use of these comments.

Kevin Burton

INNODB uses it’s own internal buffer pool but it’s in conflict with the Linux kernel’s buffer pool.

In high load situations where INNODB is using most of the system’s memory the kernel can decide (incorrectly) to swap out MySQL.

There are two solutions here. Memlock and O_DIRECT.

Memlock is apparently unstable on Linux.

O_DIRECT is an alternative but Linus hates it.

What are your suggestions here?

Kevin Burton

Just an FYI for readers at home. The –memlock command line option calls the memlock system call to keep the process in memory.

O_DIRECT bypasses the kernel to do IO directly. I’m not sure if it works either for keeping the process locked into memory.

Kevin Burton

We’ve migrated to using INNODB so that it operates 100% out of memory. This way
every write is serial so that we can see the full write speed of the disk for
throughput.

INNODB uses fuzzy checkpointing to commit data to disk. The problem that we’ve
seen in high IO scenearios is that it commits too often and we’re only seeing
33% of the raw disk write speed. Since the commit isn’t perfectly serial it’s
having to seek on disk which slows down throughput.

If we could control the checkpointing rate we could use an 8G write ahead log
and tell innodb to do a full write of the database (basically dump memory to
disk) once every minute or two.

This way we’d see a 3x performance boost and it would write at 100MBps vs
33MBps.

Any plans to enable tuning of the checkpointing rate? Postgres exposes this
data and allows the user to tune the checkpointing values.

Kevin Burton

What’s the status of INNODB in 5.1.x?

Specifically:

What’s the current status of innodb_thread_concurrency ? Is it suggested to set
this value to a LARGE value (somewhat like 200) for additional throughput?

Brian Aker did some interesting work on this:

http://krow.livejournal.com/542306.html

And it looks like 5.1.22 has the necessary patches.

Also, What’s the deal with group commit being broken in 5.1.x? Is it possible
to get the same benefit if you’re performing INSERT with multiple values? We
build up INSERTS on our clients and insert values 50 or so at a time.

Rumor has it that if you disable the new XA support (which we don’t use) then
you’ll get group commit again.

Kevin Burton

INNODB has typically had problems scaling on multicore boxes. The new quad core CPUs from AMD and Intel means that 8-core boxes are going to be common place.

We’re considering buying 40 8-core boxes with 32G of memory.

Have any specific thoughts here? INNODB was originally written on single core CPUs.

The other day I blogged about how this could be considered another distributed computing fallacy:

http://feedblog.org/2007/09/23/distributed-computing-fallacy-9/

Roland Bouman

In InnoDB, the referenced columns of a unique constraint need not form a primary key or a unique constraint: it is sufficient if the referenced columns form a consecutive set of leftmost columns of any index.

Can you describe a real-world use case where it would be useful to have a foreign key not reference an entire primary key or unique constraint (something that is mandatory in every other RDBMS that suppors foreign key constraints)? Any concrete examples are appreciated.

Thanks in advance,

Roland.

Roland Bouman

gah…I meant: “the columns referenced by a *foreign key constraint* need not form a primary key or a unique constraint”. Hope the question is clear now.

mfc

If my workload has a hi amount of concurrent writes to about 6 tables,
what version of Innodb/MySQL would you recommend? I’ve been staying at
4.1.14 because of all the bad things I’ve heard about loss of performance
with 5.0.X. I’m hopeful that 5.1.X will be better, but still not sure.

Follow up question:

My impression (from an external user of Innodb point of view) is that performance bugs
don’t receive very hi priority. Yet, a lot of us are using MySQL because it seems to offer
the neccessary performance. Are we just kidding ourselves, and should we switch to Oracle?

mike

Would doubling the amount of RAM from 8Gig to 16Gig help the write performance of Innodb
on Solaris?

Thanks

mike

Question about using solid state disks for storing ib_log* files

I know that innodb writes files in the log file group in
a round robin fashion. I also know that there is some writing that always occurs
to the first file in a log file group. My questions are:

1) What is always being written to the first file in the log file group?
2) How often or what is the pattern of these first log file writes?
3) Would they always occur to the same memory location(s)?
4) Do they occur for every log entry?

I’m just trying to see if there will be some special interaction between these
always occurring writes to the first file in the log file group and the
SSD dynamic wear leveling algorithm. I’m also wondering if you would recommend
using SSD for this purpose, and how much performance gain there might be.

Sigurd Urdahl

Hi Heikki,

I believe this is a somewhat trivial question, but it’s hard to find a good answer to it while googling.

“What is the good way to split memory between INNODB buffers and OS?”

Of course this depends on all kinds of things, but the only rule of thumb I seem to find is “INNODB should get around 70-80%” of the available RAM. I would guess that the number varied more, depending on the availability of RAM? On a dedicated server for Mysql you will still need some memory for the OS and the various stuff that will be running (monitoring, backup, cron, syslog, etc). But the ratio of RAM used for these non-mysql processes drops rapidly when you increase the amount of system memory.

We just deployed a new Linux/Mysql server for a client, upgrading their main database (from 16GB, 2 two-year old HT’ed XEONs to 32 GB, 4x quad-core AMD Opterons. Yes it helped:-). At the moment the server has around 21 GB of memory used for applications (of which Mysql is the main one with “innodb_buffer_pool_size=20480M”) while the bulk of the rest (9 GB) is used for the OS cache (yes I know, we are below 70-80% already)

So, is the general recomondation also for boxes with larger amounts of RAM, to give INNODB ~70-80% of the total RAM, and leave the rest to the OS cache? Is there special options or tricks that should be implemented on larger servers?

kind regards,
-sig

Kevin Burton

Sigurd,

Maybe this will make Heikki’s job easier 🙂

Don’t give anything to the OS cache. It’s redundant with the innodb buffer pool.

Instead just give as MUCH memory as possible to INNNODB. 100% if possible.

Kevin

Kevin Burton

Here’s another question.

Wouldn’t a good feature be the ability to open the write ahead logs as O_DIRECT? I have a feeling it’s putting pressure on the VM by having pages enter the cache and thus putting some of the buffer pool into the page file.

Opening the binary logs with O_DIRECT would fix this problem.

Kevin

Sigurd Urdahl

(in response to comment #19)

Kevin,

thanks for the answer, but it was maybe a bit more to the point than I was hoping for:-)

Are you saying that the 70-80%-rule is wrong, or that the general rule should be “keep enough free memory for anything else to run comfortably, and then give the rest to INNODB”?

kind regards,
-sig

Peter Bach

There have been several MySQL bugs opened about multi-core scalability (concurrent queries, autoincrement, concurrent inserts, instrumentation, etc.). Rather than treat each defect as an individual case, is there any performance testing going on to quantify what are the limits of different operations for different levels of concurrency, specifically around mutexes and contention? How does this fit into the development process?

Peter Bach

We have a fairly large Innodb table (150GB) that is showing poor performance for full table scans. using O_DIRECT, we are seeing individual 16KB paged I/O going out one at a time, that are esentially reads misses (10ms each). The reads are not excactly sequential, there is a small gap between each.

it is difficult to figure our what the problem is. We can OPTIMIZE the table, but how long will this take? Is there a facility we can run to show any table / index fragmentation, without having to run the OPTIMIZE?

Kevin Burton

Peter.

Regarding #23… if you’re using code to do a full table scan try to scan it by using ORDER BY based on the primary key.

Since INNODB is clustered you’ll be able to do sequential reads across the WHOLE table.

Of course at 150G and 100MBps (assuming this is the serial IO rate of your disk subsystem) it will still take 25 minutes to process the whole table.

Note that this is the theoretical maximum performance you’ll ever be able to achieve.

Kevin

Kevin Burton

Peter.

Regarding #22… you’re right. Multi-core scalability is a BIG problem that I’d like to see addressed with INNODB.

BTW……. I think MySQL should assume machines will have 8 cores from here on out. They’re REALLY cheap. You can get quad cores from AMD + Intel now and put two chips per box.

An 8 core machine with 16G of memory is only abotu $4k now.

Kevin Burton

Sigurd,

Regarding #21

…. Yes……

“{the general rule should be “keep enough free memory for anything else to run comfortably, and then give the rest to INNODB”

Ideally you wouldn’t run anything else but if you do just allocate that enough memory and then give the rest to INNODB.

Kevin

Kevin Burton

Another question.

Why can’t I open my log files up as O_DIRECT? I think I’m seeing cache pressure on the Linux VM which is causing the buffer pool to dive into the page file.

Memlock works to fix it of course but there’s no reason to cache the log file I think.

Kevin

Alexander Rubin

Kevin,

You can tune swappiness on the OS level.
Depending on the VM configuration Linux will either shrink the buffer cache or swap programs out.

Using “echo 0 > /proc/sys/vm/swappiness” the VM strategy is being set to “always shrink the buffer cache”. To make the setting persistent, add

vm.swappiness = 0

to /etc/sysctl.conf.

The value is a percentage. By setting it to 100, Linux will always swap programs out and never shrink the buffer cache.

Kevin Burton

Hey Alexander.

I’m aware of the swappiness setting.

In fact my blog is #3 in google for the query “swappiness”

… that said. It doesn’t work.

I blogged about it here:

http://feedblog.org/2007/09/29/using-o_direct-on-linux-and-innodb-to-fix-swap-insanity/

Kevin

mike

Hi,

With a 1.5 TB database on 4.1.14 consisting mostly of Innodb tables, I’m looking at upgrading to 5.1.X
as soon as it becomes general release. I can’t afford much downtime and I know there is reformatting
involved to go to the next release.

The manual recommends going from 4.1 to 5.0 and 5.0 to 5.1 but that sounds like several reformatting steps.
With a large database, that means lots of downtime.

How can I avoid as much reformatting as possible and still move to 5.1.X? Can you give me the name of
someone who you consider a expert on migration so I can discuss this? My email is [email protected]

Thanks

Pavel Francirek

Why does InnoDB start to loose space after some amount of queries (usually every few weeks at thousand q/s traffic) and only restart of database helps?

Morgan Tocker

Pavel: I would assume it’s your purge thread lagging in InnoDB if restarting improves performance. Check your processlist and see if you have a long running query.

Alexander Rubin

Mike,
If you will make a dump and restore then you do not need to do 4.1->5.0->5.1
However, dump/restore of 1.5TB is very slow. You can try running dump in parallel for many table.
For migrating from 4.1 to 5.1 you can also try starting MySQL 5.1/innodb on 4.1 tablespace.

I’ve done migration of 1.3TB from 4.0 to 5.0 and we have to do dump/restore (too many changes from 4.0 to 4.1). There are actually less changes from 4.1 to 5.0.

To minimize downtime you can use a slave server. Setup a slave and make sure it is uptodate with master, perform migration on slave, wait until slave will catch up again, test slave and move to the slave (so slave will now be your main/master box)

Alexander Rubin

Question to Heikki

Seems that adaptive hash indexes are not used for primary key lookups or not showed up in “show innodb status”.
select * from table where id=

show innodb status:

Hash table size 10624987, used cells 1, node heap has 1 buffer(s) 0.00 hash searches/s, 15979.01 non-hash searches/s

mike

Re: 33

To sharpen the migration question more…can a 4.1.14 database be used without problem
agains a 5.1.X code base? I really need to avoid as much downtime as possible even
if it is done with parallel dumps and imports. Several tables over 500M+ rows.

Thanks,

Mike

Harrison

Kevin,

Re: 27

The reason you can’t use log files with O_DIRECT is due to how the log file is written. O_DIRECT has a lot of restrictions on alignment, size of requests, etc… The InnoDB tablespace uses 16k pages, which happens to match up perfectly with all of those restrictions. The REDO log is not aligned or padded in anyway to allow this.

In theory, you could rework how the log file is written and make it obey all of those rules, but it is certainly much more complex than it initially sounds of just using an O_DIRECT flag.

– Harrison

Kevin Burton

Hey Harrison.

Fair enough. In fact if this is a VM bug then it would be better to either fix this or disable OS level paging or restrict it to less than 100% of the memory.

The problem I was having is that Linux decides to page even though the whole DB fits in memory and there’s plenty of it available.

RHEL actually has patches which allow you to allocate less than 100% of system memory to cache which was used by tuners running Oracle on RHEL.

For now though memlock actually works fine.

Kevin

Alexander Rubin

Re: 35

Mike,

If you will do dump/restore there will be no problems with binary incompatibilities (as you will recreate the tables), except for the reserved words.

New reserved words were introduced in 5.0 and 5.1:
http://dev.mysql.com/doc/refman/5.0/en/reserved-words.html
http://dev.mysql.com/doc/refman/5.1/en/reserved-words.html

If your table names/column names in ver. 4.1 uses reserved words that can be a problem.
To test you can dump schema only in ver. 4.1 and restore it to the additional 5.1 server.

Do you want to migrate to MySQL 5.1 (it is not released yet) because of the partitioning?

Again, using slave server will help decrease downtime, as you will be able to write to the master even during the migration process. 1.5TB dump/restore – this will take a long time, may be couple of weeks.

Hope this will help

mike

Re: 38

Hi Alexander,

I can afford maybe 24 hours max. Thats why I don’t want to
dump/restore.

Reasons to migrate are:

1) Be on latest version
2) Hopefully 5.1.X has better performance due to:
a) O_DIRECT works in 5.1 on Solaris
b) 5.1 eliminates concurrency problem with auto-inc
c) Partitioning
3) My boss wants to

So not real strong reasons. I can stay on 4.1.14 if I
have to.

The master/slave idea sounds interesting, can you send me
email so we can discuss offline?

mike

I know that SELECT COUNT(*) or COUNT(column) from InnoDB is not optimized because of how InnoDB does row versioning or something it does not keep track of the row count. So people suggest using a separate table with table_name/num_rows. However, couldn’t this easily be fixed in MySQL – if table handler is InnoDB, a select count() (with no WHERE clause, obviously) could read the row count from INFORMATION_SCHEMA or maybe just another internal variable? Was just thinking about this earlier today, thought I would pose it as a question.

Alexander Rubin

re: 41

Mike, INFORMATION_SCHEMA or show table status info is not 100% correct, it is approximate. And count(*) will give you exact number of rows in innodb (with the considiration of other transactions/repeatable read)

mike

re: 42

there could still be some other way to intercept that and give a reliable method without making separate tables and having to add application logic to update it right?

Alexander Rubin

Re: 42

That is a good question for Heikki, actually.

Khaled

Dear Sir,,,

Will you please to help me for getting the two-phase locking algorithm pseudo code or the Intention lock algorithm pseudo code
Thank you

Kunal Jain

I have a QuadCore Server(4 Cores). How can i configure mysql to use all the four cores. Right now mysql process use only one core whose utilization goes upto 95%.

Thanks in advance
Kunal Jain

maricar pineda

petter ilan kami dito

maricar pineda

petter answer my question

maricar pineda

petter what name my father