September 6, 2010

UDF -vs- MySQL Stored Function

Posted by Aurimas Mikalauskas |

Few days ago I was working on a case where we needed to modify a lot of data before pushing it to sphinx – MySQL did not have a function to do the thing so I thought I’ll write MySQL Stored Function and we’ll be good to go. It worked! But not so well really – building the index, which was taking 10 minutes, was now taking 16 minutes. Then we added another MySQL function for different set of attributes and indexing speed went from 16 minutes to 26 minutes. I knew using UDF would be faster, but I had no idea how much. Have you ever wondered? [read more...]

September 2, 2010

How long Innodb Shutdown may take

Posted by peter |

How long it may take MySQL with Innodb tables to shut down ? It can be quite a while.
In default configuration innodb_fast_shutdown=ON the main job Innodb has to do to complete shutdown is flushing dirty buffers. The number of dirty buffers in the buffer pool varies depending on innodb_max_dirty_pages_pct as well as workload and innodb_log_buffer_size and can be anywhere from 10 to 90% in the real life workloads. Innodb_buffer_pool_pages_dirty status will show you the actual data. Now the flush speed also depends on number of factors. First it is your storage configuration – you may be looking at less than 200 writes/sec for single entry level hard drive to tens of thousands of writes/sec for high end SSD card. Flushing can be done using multiple threads (in XtraDB and Innodb Plugin at least) so it scales well with multiple hard drives. The second important variable is your workload, especially how dirty pages would line up on the hard drive. If there are a lot of sequential pages which are dirty Innodb will be able to use larger size IOs – up to 1MB flushing dirty pages which can be a lot faster than flushing data page by page.

So if we have system with single hard drive doing 200 IO/ssc, 48G buffer pool which is 90% dirty and completely random page writes we’ll look at 13500 seconds or about 5min per 1GB of Buffer pool size.
This is worse case scenario though it is quite common in practice to see shutdown time of about 1min per GB of buffer pool per hard drive.

Baron has written a nice post how to decrease innodb shutdown time which you may want to read on this topic.

August 31, 2010

Introducing tcprstat, a TCP response time tool

Posted by Baron Schwartz |

Ignacio Nin and I (mostly Ignacio) have worked together to create tcprstat[1], a new tool that times TCP requests and prints out statistics on them. The output looks somewhat like vmstat or iostat, but we’ve chosen the statistics carefully so you can compute meaningful things about your TCP traffic.

What is this good for? In a nutshell, it is a lightweight way to measure response times on a server such as a database, memcached, Apache, and so on. You can use this information for historical metrics, capacity planning, troubleshooting, and monitoring to name just a few.

[read more...]

August 23, 2010

InnoDB memory allocation, ulimit, and OpenSUSE

Posted by Sasha Pachev |

I recently encountered an interesting case. A customer reported that mysqld crashed on start on OpenSUSE 11.2 kernel 2.6.31.12-0.2-desktop x86_64   with 96 GB RAM when the innodb_buffer_pool_size was set to anything more than 62 GB. I decided to try it with 76 GB. The error message was an assert due to a failed malloc() in ut_malloc_low() in ut/ut0mem.c inside InnoDB source code. InnoDB wraps the majority of its memory allocations in ut_malloc_low(), so to get an idea of the pattern of requested allocations I added a debugging fprintf() to tell me how much was being allocated and whether it was successful.

I discovered something interesting. I expected the allocation to fail on the 76 GB of the buffer pool, due to some weird memory mapping issue and a continuous block of 76 GB not being available. However, that is not what happened. 76 GB buffer was allocated successfully. What was failing is the allocation of 3.37GB after that. What in the world could InnoDB need that was 3.37 GB? There was nothing in the settings that asked for anything close to 3 GB explicitly.

Source code is the ultimate documentation, and I took advantage of that. My good friend GDB guided me to buf_pool_init() in buf/buf0buf.c. There I found the following:
[read more...]

August 19, 2010

High availability for MySQL on Amazon EC2 – Part 4 – The instance restart script

Posted by Yves Trudeau |

This post is the fourth of a series that started here.

From the previous of this series, we now have resources configured but instead of starting MySQL, Pacemaker invokes a script to start (or restart) the EC2 instance running MySQL. This blog post describes the instance restart script. Remember, I am more a DBA than a script writer so it might not be written in the most optimal way.

[read more...]

Percona talks at OpenSQL Camp this weekend

Posted by Morgan Tocker |

Four Perconians (perconites?) will be at OpenSQL Camp in Sankt Augustin, Germany this weekend presenting talks on:

  • Recovery of Lost or Corrupted InnoDB Tables
  • Keep your MySQL backend online no matter what
  • XtraDB — InnoDB on steroids
  • Xtrabackup for MySQL

If you would like to stop by and say hello, we are Aleksandr, Istvan, Morgan and Aurimas (pictures here).

If you can make the (approximate) location, but not the date, we also have training in Frankfurt in three weeks time.

August 18, 2010

Announcing Training for Operations Teams

Posted by Morgan Tocker |

We’re opening up registration for our new training courses today.  In short: we are moving from two days to a new four-day format.  The new additions are created by:

  • Splitting our current InnoDB day in half. We now have one day for DBAs, and one day just on InnoDB topics.
  • A new Operations Day – covering how to maintain MySQL in production.

Our developer course has also undergone revision, and we now have more query tuning examples, and a new instrumentation chapter.

What is operations training?

Many companies split their technical staff between development, and operations.  The operations team is responsible for tasks such as capacity planning, backup/disaster recovery, and carrying a pager. They are the heroes that fight fires.

Our operations day of training is delivered as a hands on class.  Attendees will be divided into teams, and given a series of challenges to complete on EC2 machines.  As part of the development of this course we wrote a sample LAMP application, embedded with minor flaws which students will need to fix while they try and keep the application up.

Where can you attend?

We’re starting off by branding the operations day as a ‘BETA’.  You can attend for only $100 at San Francisco Thursday, 30 Sep 2010, New York Thursday, 14 Oct 2010 or Vancouver Thursday, 21 Oct 2010. There is also a discount of attending all four days for $1,450 if you book before 30 August.

After our initial BETA, the courses will be available in more USA and international locations. A partial list is already available on the training section of our website. We will confirm more cities in the coming weeks.

August 16, 2010

Testing MySQL column stores

Posted by Justin Swanhart |

Recently I had the opportunity to do some testing on a large data set against two MySQL column-store storage engines. I’d like to note that this effort was sponsored by Infobright, but this analysis reflects my independent testing from an objective viewpoint.

I performed two different types of testing. The first focused on core functionality and compatibility of ICE (Infobright Community Edition) compared with MyISAM on a small data set. The second part of my testing compared the performance and accuracy of ICE with InfiniDB Community Edition on a 950GB data set.
[read more...]

August 10, 2010

Why message queues and offline processing are so important

Posted by Morgan Tocker |

If you read Percona’s whitepaper on Goal-Driven Performance Optimization, you will notice that we define performance using the combination of three separate terms. You really want to read the paper, but let me summarize it here:

  1. Response Time – This is the time required to complete a desired task.
  2. Throughput – Throughput is measured in tasks completed per unit of time.
  3. Capacity – The system’s capacity is the point where load cannot be increased without degrading response time below acceptable levels. [read more...]

July 31, 2010

Why you can’t rely on a replica for disaster recovery

Posted by Baron Schwartz |

A couple of weeks ago one of my colleagues and I worked on a data corruption case that reminded me that sometimes people make unsafe assumptions without knowing it. This one involved SAN snapshotting that was unsafe.

In a nutshell, the client used SAN block-level replication to maintain a standby/failover MySQL system, and there was a failover that didn’t work; both the primary and fallback machine had identically corrupted data files. After running fsck on the replica, the InnoDB data files were entirely deleted.

When we arrived on the scene, there was a data directory with an 800+ GB data file, which we determined had been restored from a SAN snapshot. Accessing this file caused a number of errors, including warnings about accessing data outside of the partition boundaries. We were eventually able to coax the filesystem into truncating the data file back to a size that didn’t contain invalid pointers and could be read without errors on the filesystem level. From InnoDB’s point of view, though, it was still completely corrupted. The “InnoDB file” contained blocks of data that were obviously from other files, such as Python exception logs. The SAN snapshot was useless for practical purposes. (The client decided not to try to extract the data from the corrupted file, which we have specialized tools for doing. It’s an intensive process that costs a little money.)

The problem was that the filesystem was ext2, with no journaling and no consistency guarantees. A snapshot on the SAN is just the same as cutting the power to the machine — the block device is in an inconsistent state. A filesystem that can survive that has to ensure that it writes the data to the block device such that it can bring into a consistent state later. The techniques for doing this include things like ordered writes and meta-data journaling. But ext2 does not know how to do that. The data that’s seen by the SAN is some jumble of blocks that represents the most efficient way to transfer the changed blocks over the interconnect, without regard to logical consistency on the filesystem level.

Two things can help avoid such a disaster: 1) get qualified advice and 2) don’t trust the advice; backups and disaster recovery plans must be tested periodically.

This case illustrates an important point that I repeat often. The danger of using a replica as a backup is that data loss on the primary can affect the replica, too. This is true no matter what type of replication is being used. In this case it’s block-level SAN replication. DRBD would behave just the same way. At a higher level, MySQL replication has the same weakness. If you rely on a MySQL slave for a “backup,” you’ll be out of luck when someone accidentally runs DROP TABLE on your master. That statement will promptly replicate over and drop the table off your “backup.”

I still see people using a replica as a backup, and I know it’s just a matter of time before they lose data. In my experience, the types of errors that will propagate through replication are much more common than those that’ll be isolated to just one machine, such as hardware failures.