July 24, 2010

Caching could be the last thing you want to do

Posted by Morgan Tocker |

I recently had a run-in with very popular PHP ecommerce package which makes me want to voice a recurring mistake I see in how many web applications are architected.

What is that mistake?

The ecommerce package I was working with depended on caching.  Out of the box it couldn’t serve 10 pages/second unless I enabled some features which were designed to be “optional” (but clearly they weren’t).
[read more...]

July 20, 2010

Estimating Replication Capacity

Posted by peter |

It is easy for MySQL replication to become bottleneck when Master server is not seriously loaded and the more cores and hard drives the get the larger the difference becomes, as long as replication
remains single thread process. At the same time it is a lot easier to optimize your system when your replication runs normally – if you need to add/remove indexes and do other schema changes you probably would be looking at some methods involving replication if you can’t take your system down. So here comes the catch in many systems – we find system is in need for optimization when replication can’t catch up but yet optimization process we’re going to use relays on replication being functional and being able to catch up quickly.

So the question becomes how can we estimate replication capacity, so we can deal with replication load before slave is unable to catch up.
[read more...]

July 17, 2010

SSD: Free space and write performance

Posted by Vadim |

( cross posting from SSD Performance Blog )
In previous post On Benchmarks on SSD, commenter touched another interesting point. Available free space affects write performance on SSD card significantly. The reason is still garbage collector, which operates more efficiently the more free space you have. Again, to read mode on garbage collector and write problem you can check Write amplification wiki page.

To see how performance drops with decreasing free space, let’s run sysbench fileio random write benchmark with different file sizes.

For test I took FusionIO 320 GB SLC PCIe DUO™ ioDrive card, with software stripping between two cards, and there if graph how throughput depends on available free space ( the bigger file – the less free space)

The system specification and used scripts you can see on Benchmark Wiki

On graph you can see two line ( yes, there are two lines, even they are almost identical).
First line is result when FusionIO is formatted to use full capacity, and second line is for case when I use additional space reservation ( 25% in this case, that is 240GB available). There is no difference in this case, however additional over-provisioning protects you from overusing space, and keeps performance on corresponding level.

It is clear the maximal throughput strongly depends on available free space.
With 100GiB utilization we have 933.60 MiB/sec,
with 150GiB (half of capacity) 613.48 MiB/sec and
with 200GiB it drops to 354.37 MiB/sec, which is 2.6x times less comparing with 100GiB.

So returning to question how to run proper benchmark, the result significantly depends what percentage of space on card is used, the results for 100GiB file on 160 GB card, will be different from the results for 100GiB file on 320 GB card.

Beside free space, the performance also depends on garbage collector algorithm by itself, and the card from different manufactures will show different results. Some new coming cards make high performance in case with high space utilization as competitive advantage, and I am going to run the same analysis on different cards.

July 16, 2010

Analyzing the distribution of InnoDB log file writes

Posted by Baron Schwartz |

I recently did a quick analysis of the distribution of writes to InnoDB’s log files. On a high-traffic commodity MySQL server running Percona XtraDB for a gaming workload (mostly inserts to the “moves” table), I used strace to gather statistics about how the log file writes are distributed in terms of write size. InnoDB writes to the log files in multiples of 512 bytes. Mark Callaghan explained this and some of its performance implications here. How many big writes does InnoDB do, and how many small writes?

[read more...]

July 15, 2010

Data mart or data warehouse?

Posted by Justin Swanhart |

This is part two in my six part series on business intelligence, with a focus on OLAP analysis.

  1. Part 1 – Intro to OLAP
  2. Identifying the differences between a data warehouse and a data mart. (this post)
  3. Introduction to MDX and the kind of SQL which a ROLAP tool must generate to answer those queries.
  4. Performance challenges with larger databases, and some ways to help performance using aggregation.
  5. Using materialized views to automate that aggregation process.
  6. Comparing the performance of OLAP with and without aggregation over multiple MySQL storage engines at various data scales.

What is a data warehouse?
It turns out that this question is a little more difficult to answer than it probably should be. This is because data warehousing has become an overloaded term that includes BI tools (OLAP/data mining), data extraction and transformation tools (ETL), and schema management tools.
[read more...]

July 14, 2010

On Benchmarks on SSD

Posted by Vadim |

(cross post from SSD Performance Blog )
To get meaningful performance results on SSD storage is not easy task, let’s see why.
There is graph from sysbench fileio random write benchmark with 4 threads. The results were taken on PCI-E SSD card ( I do not want to name vendor here, as the problem is the same for any card).

The benchmark starts on the newly formatted card, and some period (fresh period A) you see line with high result, which at some point of time drops (point B) and after some recovery period there is steady state ( state C ).

What happens there, as you may know, SSD has garbage collector activity, and the point B is time when garbage collector starts its work. You can read more on this topic on
Write amplification wiki page.

So as you understand it is important to know, what the state the card was in, when the benchmark was running. Apparently, many manufactures like to put in the specification of device the result from fresh period A, while I think steady state C is more important for end users. So in my further results I will point what was the state of the card during benchmark.

However it makes task of running benchmark on SSD trickier. It is similar to benchmarks on database but up-down. The database just after start is in “cold state” and you need to make sure you have enough warmup and only take results in the hot state, when all internal buffers are filled and populated.
Well, you may say – just to put card in steady state C and run the benchmark, but it is only part of the problem.

The next issue comes from TRIM command. TRIM command is the command sent to device when the file is deleted, and it allows for SSD controller to mark all space related to file as free and reuse it immediately. Not all devices support TRIM command, for example the first generation of Intel SSD cards did not support it, while G2 does.
So why TRIM is the problem for the benchmark – basically if you delete all files, it returns the card to fresh state A. The many benchmark scenarios ( and my initial sysbench fileio scripts) suppose to create files at the start of benchmark and remove afterward. The similar issue is when you restore database from backup, run benchmark, and remove files. That it may happen during your run you cover all states A->B->C, and the final result is pretty much useless. So as the conclusion if you want to see the result from steady state you should make sure you have it in your benchmark.

As we speak about benchmark results, there is another trick from vendors, I want to put your attention. Quite often you can see in specification from imaginary Vendor X say:

  • Read: Up to 520 MB/s
  • Write: Up to 480 MB/s
  • Sustained Write: Up to 420 MB/s
  • Random Write 4KB : 70,000 IOPS

The good thing there is that vendor put both maximal write ( most likely from state A) and Sustained Write ( I guess from state C).
However if you multiply 4KB*70000IOS, you will get 280000KB/s = 274MB/s, which is quite far from
declared 520MB/s.
What is the trick there: the trick is that maximal throughput in MB/sec you are getting when you use big block size, say 64K or 128K, and maximal throughput in IOPS you are getting when you use small block size, 4K in this case.

So when you read Write: Up to 480 MB/s, Random Write 4KB : 70,000 IOPS, you should know that 480MB/s was received with big block size, and for 4KB block size you should expect only 274MB/s ( and most likely in fresh state A).

As SSD market involving, we will see more and more the benchmark results, so be ready to read it carefully.

SLC vs MLC

Posted by Vadim |

(cross posting from SSDPeformanceBlog.com )
All modern solid state drives use NAND memory based on SLC (single level cell) or MLC (multi level cell) technologies.
Not going into physical details – SLC basically stores 1 bit of information, while MLC can do more. Most popular option for MLC is 2 bit, and there is movement into 3 bit direction.

This fact gives us next characteristics:

  • SLC provides less capacity
  • SLC is more expensive
  • SLC is know to have better quality cheap

Along with that there is also limitation on amount of write operations. SLC can handle about 100,000 write cycles, while MLC is 10,000 ( the numbers are rough, and changing with technology improvement)

No wonder that vendor very quickly come with next separation:

  • SLC for enterprise market ( servers )
  • MLC for consumer market ( desktops, workstations, laptops)

As obvious example here is Intel SSD cards: X25-E ( SLC) is sold as enterprise level card, and X25-M ( MLC ) is sold for mass market. As another example of difference in capacity and price:

  • FusionIO 160GB SLC card price $6,308.99
  • FusionIO 320GB MLC card price $6,729.99

That is for the same price MLC card comes with doubled capacity.

However with increasing capacity difference between MLC and SLC is getting fuzzier. For MLC most critical part is software (firmware) algorithm which ensures a uniform usage of available NAND chips, and with bigger capacity it is much easier to implement.
This problem with handling lifetime and manage write cycle for MLC opened way for hardware solution like SandForce controller and recently Anabit announced “Memory Signal Processing (MSP™) technology enables MLC-based solutions at SLC-grade reliability and performance”.

Also important is increasing capacity for MLC devices, for example, if we take 10,000 writes vs 100,000 writes than to provide the same life time MLC would need about 10x more capacity, and
it seems not problem. I expect soon we will see MLC cards with 1600GB, which ideally will have the same lifetime as SLC 160GB cards.

On this way interesting to see Intel announces enterprise line for SSD card will be based on
eMLC
( enterprise MLC ), where each cell has 30,000 writes lifetime and with maximal capacity 400GB

So it seems market is gradually moving into “MLC is ready for enterprise” direction, and sounds as good option to have devices with high capacity and reasonable price in near future.

Some articles on this topics:

July 13, 2010

Scaling: Consider both Size and Load

Posted by peter |

So lets imagine you have the server handling 100.000 user accounts. You can see the CPU,IO and Network usage is below 10% of capacity – does it mean you can count on server being able to
handle 1.000.000 of accounts ? Not really, and there are few reasons why, I’ll name most important of them:

Contention – This is probably the most obvious one. MySQL (and systems in general) do not scale perfectly with numbers of CPUs and number of concurrent requests. Reduced efficiency of CPU cache, Mutex contention and database lock contention all come here. Some of them are preventable and can be reduced by code changes, such as there have been a lot of advanced in scalability of MySQL by improving locking code design, others, such as row level locks would require application changes to allow more concurrent process. The scalability numbers depend a lot on the system scale, software and workload.
[read more...]

Percona at OSCON 2010

Posted by peter |

This year we’re participating in OSCON as a Sponsor and organizing some BOFs. I will be on the conference 21 and 22 if you’re interested to chat.
Here is the list of currently scheduled BOFs which I’ll be hosting:
Running Databases on Flash Storage
Sphinx Search 2010
XtraDB, XtraBackup, Maatkit, Percona Server
See you there.

July 12, 2010

Intro to OLAP

Posted by Justin Swanhart |

This is the first of a series of posts about business intelligence tools, particularly OLAP (or online analytical processing) tools using MySQL and other free open source software. OLAP tools are a part of the larger topic of business intelligence, a topic that has not had a lot of coverage on MPB. Because of this, I am going to start out talking about these topics in general, rather than getting right to gritty details of their performance.

I plan on covering the following topics:

  1. Introduction to OLAP and business intelligence. (this post)
  2. Identifying the differences between a data warehouse, and a data mart.
  3. Introduction to MDX queries and the kind of SQL which a ROLAP tool must generate to answer those queries.
  4. Performance challenges with larger databases, and some ways to help performance using aggregation.
  5. Using materialized views to automate that aggregation process.
  6. Comparing the performance of OLAP with and without aggregation over multiple MySQL storage engines at various data scales.

[read more...]