<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Disaster:  LVM Performance in Snapshot Mode</title>
	<atom:link href="http://www.mysqlperformanceblog.com/2009/02/05/disaster-lvm-performance-in-snapshot-mode/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.mysqlperformanceblog.com/2009/02/05/disaster-lvm-performance-in-snapshot-mode/</link>
	<description>Everything about MySQL Performance</description>
	<lastBuildDate>Sat, 21 Nov 2009 05:23:57 -0800</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Timothy Denike</title>
		<link>http://www.mysqlperformanceblog.com/2009/02/05/disaster-lvm-performance-in-snapshot-mode/comment-page-1/#comment-625943</link>
		<dc:creator>Timothy Denike</dc:creator>
		<pubDate>Fri, 07 Aug 2009 21:58:35 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=610#comment-625943</guid>
		<description>I found similar benchmark results (6:1) under a single-threaded test, but the random write results approached 2:1 as I increased thread concurrency to 10.  (Presumably because the read on the COW scales with concurrency, while writes do not.)  Seems to me the penalty could be the single-threaded read.  Is this a valid test to profile InnoDB load, or are all InnoDB writes going to be serialized through a single thread?</description>
		<content:encoded><![CDATA[<p>I found similar benchmark results (6:1) under a single-threaded test, but the random write results approached 2:1 as I increased thread concurrency to 10.  (Presumably because the read on the COW scales with concurrency, while writes do not.)  Seems to me the penalty could be the single-threaded read.  Is this a valid test to profile InnoDB load, or are all InnoDB writes going to be serialized through a single thread?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: René Leonhardt</title>
		<link>http://www.mysqlperformanceblog.com/2009/02/05/disaster-lvm-performance-in-snapshot-mode/comment-page-1/#comment-504914</link>
		<dc:creator>René Leonhardt</dc:creator>
		<pubDate>Fri, 13 Mar 2009 17:32:02 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=610#comment-504914</guid>
		<description>@Peter Vajgel:
I am wondering how Ext4 would compete with XFS: Extents and multiblock + delayed allocation.
http://kernelnewbies.org/Ext4</description>
		<content:encoded><![CDATA[<p>@Peter Vajgel:<br />
I am wondering how Ext4 would compete with XFS: Extents and multiblock + delayed allocation.<br />
<a href="http://kernelnewbies.org/Ext4" rel="nofollow">http://kernelnewbies.org/Ext4</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: peter</title>
		<link>http://www.mysqlperformanceblog.com/2009/02/05/disaster-lvm-performance-in-snapshot-mode/comment-page-1/#comment-479963</link>
		<dc:creator>peter</dc:creator>
		<pubDate>Tue, 17 Feb 2009 05:53:55 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=610#comment-479963</guid>
		<description>Rilson,

In this case we&#039;re just using one snapshot (which is enough for database backup) and it does not work.

I have not tried Zumastor - if you can rerun my benchmarks and post results that would be quite interesting.</description>
		<content:encoded><![CDATA[<p>Rilson,</p>
<p>In this case we&#8217;re just using one snapshot (which is enough for database backup) and it does not work.</p>
<p>I have not tried Zumastor &#8211; if you can rerun my benchmarks and post results that would be quite interesting.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Rilson Nascimento</title>
		<link>http://www.mysqlperformanceblog.com/2009/02/05/disaster-lvm-performance-in-snapshot-mode/comment-page-1/#comment-479794</link>
		<dc:creator>Rilson Nascimento</dc:creator>
		<pubDate>Tue, 17 Feb 2009 02:57:08 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=610#comment-479794</guid>
		<description>Hi Peter,

Have you tried Zumastor snapshot and replication tools? 

Zumastor is an enterprise storage server for Linux. It keeps all snapshots for a particular volume in a common snapshot store, and shares blocks the way one would expect. Thus making a change to one block in a file in the original volume only uses one block in the snapshot store no matter how many snapshots you have (verbatim from Zumastor how-to).

LVM snapshots design has the surprising property that every block you change on the original volume consumes one block for each snapshot. The resulting speed and space penalty makes the use of more than 1 or 2 snapshots at a time impractical (verbatim from Zumastor how-to page as well).</description>
		<content:encoded><![CDATA[<p>Hi Peter,</p>
<p>Have you tried Zumastor snapshot and replication tools? </p>
<p>Zumastor is an enterprise storage server for Linux. It keeps all snapshots for a particular volume in a common snapshot store, and shares blocks the way one would expect. Thus making a change to one block in a file in the original volume only uses one block in the snapshot store no matter how many snapshots you have (verbatim from Zumastor how-to).</p>
<p>LVM snapshots design has the surprising property that every block you change on the original volume consumes one block for each snapshot. The resulting speed and space penalty makes the use of more than 1 or 2 snapshots at a time impractical (verbatim from Zumastor how-to page as well).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: peter</title>
		<link>http://www.mysqlperformanceblog.com/2009/02/05/disaster-lvm-performance-in-snapshot-mode/comment-page-1/#comment-477682</link>
		<dc:creator>peter</dc:creator>
		<pubDate>Sun, 15 Feb 2009 06:35:27 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=610#comment-477682</guid>
		<description>Brian, 

Yeah I will try it again when I have the chance.... also will see with XFS.

The problem is based on discussion we&#039;re still looking at 3X of IOs (one read and 2 writes) which is already too large.</description>
		<content:encoded><![CDATA[<p>Brian, </p>
<p>Yeah I will try it again when I have the chance&#8230;. also will see with XFS.</p>
<p>The problem is based on discussion we&#8217;re still looking at 3X of IOs (one read and 2 writes) which is already too large.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: peter</title>
		<link>http://www.mysqlperformanceblog.com/2009/02/05/disaster-lvm-performance-in-snapshot-mode/comment-page-1/#comment-477674</link>
		<dc:creator>peter</dc:creator>
		<pubDate>Sun, 15 Feb 2009 06:31:01 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=610#comment-477674</guid>
		<description>Kostas,

In this case it was default. If you suggest some parameters which should help performance let me know we should try it.</description>
		<content:encoded><![CDATA[<p>Kostas,</p>
<p>In this case it was default. If you suggest some parameters which should help performance let me know we should try it.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Brian Sneddon</title>
		<link>http://www.mysqlperformanceblog.com/2009/02/05/disaster-lvm-performance-in-snapshot-mode/comment-page-1/#comment-474201</link>
		<dc:creator>Brian Sneddon</dc:creator>
		<pubDate>Fri, 13 Feb 2009 00:49:26 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=610#comment-474201</guid>
		<description>If you bypass lvcreate and just configure the snapshot in the device-mapper manually you can designate the device as persistent or not.  If it&#039;s not persistent then it won&#039;t write out metadata to the disk... if that&#039;s where the bottleneck is in the code then that may reveal it.  It may be possible to do it through lvcreate, but I&#039;ve just never tried it myself.</description>
		<content:encoded><![CDATA[<p>If you bypass lvcreate and just configure the snapshot in the device-mapper manually you can designate the device as persistent or not.  If it&#8217;s not persistent then it won&#8217;t write out metadata to the disk&#8230; if that&#8217;s where the bottleneck is in the code then that may reveal it.  It may be possible to do it through lvcreate, but I&#8217;ve just never tried it myself.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kostas Georgiou</title>
		<link>http://www.mysqlperformanceblog.com/2009/02/05/disaster-lvm-performance-in-snapshot-mode/comment-page-1/#comment-473911</link>
		<dc:creator>Kostas Georgiou</dc:creator>
		<pubDate>Thu, 12 Feb 2009 21:44:33 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=610#comment-473911</guid>
		<description>What was the chunksize used for the snapshot? A small one (4k?) probably makes more sense for a random IO pattern.</description>
		<content:encoded><![CDATA[<p>What was the chunksize used for the snapshot? A small one (4k?) probably makes more sense for a random IO pattern.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: peter</title>
		<link>http://www.mysqlperformanceblog.com/2009/02/05/disaster-lvm-performance-in-snapshot-mode/comment-page-1/#comment-470639</link>
		<dc:creator>peter</dc:creator>
		<pubDate>Tue, 10 Feb 2009 02:23:49 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=610#comment-470639</guid>
		<description>Peter,

Thanks for posting results.</description>
		<content:encoded><![CDATA[<p>Peter,</p>
<p>Thanks for posting results.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Peter Vajgel</title>
		<link>http://www.mysqlperformanceblog.com/2009/02/05/disaster-lvm-performance-in-snapshot-mode/comment-page-1/#comment-470620</link>
		<dc:creator>Peter Vajgel</dc:creator>
		<pubDate>Mon, 09 Feb 2009 22:42:21 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=610#comment-470620</guid>
		<description>There are basically two snapshot technologies - copy-on-write (COW) and pointer-based block-map technologies of the log structured file systems (WAFL, ZFS). With log structured filesystems you get snapshots practically for free but there are other disadvantages like fragmentation. The traditional filesystems use COW technology on various levels - volume manager (LVM, VXVM) or in the filesystem (VxFS). If you use LVM snapshots the choice of the filesystem can be quite important. I run &quot;randomio&quot; benchmark on ext3 filesystem and compared it with xfs. The difference with LVM snapshot is big. &quot;randomio&quot; opens its target with O_DIRECT and does random io with multiple threads. Write percentage controls the ratio of reads/writes done by each thread. My setup was 6 SAS 10K drives with 512MB write-back cache in the controller in RAID1+0 configuration. First the raw numbers - 100 threads, 20% writes, 4K iosize. Iops reported are iops per second.

[root@udbqa006.sf2p /home/pv/work/src/randomio-1.3]# ./randomio /dev/mapper/VolGroup20-mysqldata 100 0.2 0 4096 60
  total &#124;  read:         latency (ms)       &#124;  write:        latency (ms)
   iops &#124;   iops   min    avg    max   sdev &#124;   iops   min    avg    max   sdev
--------+-----------------------------------+----------------------------------
 1793.9 &#124; 1433.7   0.1   69.6 2853.9   82.4 &#124;  360.2   0.1    0.1    4.2    0.1
 1675.6 &#124; 1340.0   0.1   74.6 1596.9   88.0 &#124;  335.6   0.1    0.1    2.2    0.0

[root@udbqa006.sf2p ~]# lvcreate --size 60g --snapshot --name snap /dev/VolGroup20/mysqldata
  Logical volume &quot;snap&quot; created

 1379.3 &#124; 1104.6   0.1   68.7 1560.2   86.1 &#124;  274.7   0.1   87.0 1555.4  178.1
 1061.3 &#124;  848.4   0.1   42.7  544.8   43.8 &#124;  212.9   0.1  300.9  865.7  154.5
 1082.1 &#124;  863.4   0.1   43.2  572.5   45.5 &#124;  218.7   0.1  286.7  950.5  145.4
 1077.3 &#124;  863.8   0.1   43.0  771.1   45.0 &#124;  213.5   0.1  292.6  795.3  147.4

So that&#039;s our base. Let&#039;s try the same with xfs on a large file -

[root@udbqa006.sf2p ~]# mount
/dev/mapper/VolGroup20-mysqldata on /data type xfs (rw,noatime,allocsize=1g)

[root@udbqa006.sf2p ~]# ls -l /data
total 335544320
-rw-r--r--  1 root  root  343597383680 Sep  9 16:05 foo

[root@udbqa006.sf2p /home/pv/work/src/randomio-1.3]# xfs_bmap -l /data/foo
/data/foo:
        0: [0..173950911]: 320..173951231 173950912 blocks
        1: [173950912..341722991]: 183500864..351272943 167772080 blocks
        2: [341722992..509495071]: 367263808..535035887 167772080 blocks
        3: [509495072..671088639]: 550502464..712096031 161593568 blocks

[root@udbqa006.sf2p /home/pv/work/src/randomio-1.3]# ./randomio /data/foo 100 0.2 0 4096 60
  total &#124;  read:         latency (ms)       &#124;  write:        latency (ms)
   iops &#124;   iops   min    avg    max   sdev &#124;   iops   min    avg    max   sdev
--------+-----------------------------------+----------------------------------
 1887.2 &#124; 1508.4   0.1   66.2 1694.0   77.1 &#124;  378.8   0.1    0.1    2.9    0.1
 1678.6 &#124; 1341.6   0.1   74.4 1584.6   87.0 &#124;  337.0   0.1    0.1    1.1    0.0
 1670.9 &#124; 1334.9   0.1   75.0 1567.0   88.0 &#124;  336.0   0.1    0.1    1.0    0.0

[root@udbqa006.sf2p ~]# lvcreate --size 60g --snapshot --name snap /dev/VolGroup20/mysqldata
  Logical volume &quot;snap&quot; created

 1592.2 &#124; 1271.3   0.1   74.7 1401.9   89.3 &#124;  320.8   0.1   14.7 1801.1   95.8
 1073.0 &#124;  859.6   0.1   43.5  682.1   44.5 &#124;  213.4   0.1  294.2  796.2  150.9
 1090.7 &#124;  872.6   0.1   42.9  593.9   44.0 &#124;  218.1   0.1  287.3  717.1  146.0

Good - xfs matches raw - with or without snapshot. Now let&#039;s look at ext3 with the same size file -

[root@udbqa006.sf2p /home/pv/work/src/randomio-1.3]# ./randomio /data/foo 100 0.2 0 4096 60
  total &#124;  read:         latency (ms)       &#124;  write:        latency (ms)
   iops &#124;   iops   min    avg    max   sdev &#124;   iops   min    avg    max   sdev
--------+-----------------------------------+----------------------------------
  149.2 &#124;  119.3   4.8  668.1 1042.5   54.1 &#124;   30.0   5.2  660.9 1048.3   64.6
  164.8 &#124;  131.8   1.8  608.0  740.7   43.6 &#124;   32.9 512.3  603.9  734.5   36.5
  181.4 &#124;  145.6   8.0  552.3  687.0   41.5 &#124;   35.8 419.2  547.1  671.7   36.6
  197.6 &#124;  157.7   6.8  507.8 1118.2   67.9 &#124;   39.9   0.1  498.9 1078.5   67.3
  224.3 &#124;  180.3   4.5  448.2  647.6   45.3 &#124;   43.9   0.1  439.5  623.6   47.1
  256.9 &#124;  204.2   5.1  390.9  521.1   42.9 &#124;   52.7   0.2  383.4  505.7   43.1
  295.4 &#124;  236.6   3.0  340.5  474.3   44.4 &#124;   58.9   0.1  330.5  461.9   46.4
  350.3 &#124;  279.4   5.0  287.5  419.8   43.2 &#124;   70.9   0.1  277.6  401.6   43.8
  421.5 &#124;  336.8   1.6  239.7  414.8   44.0 &#124;   84.7   0.1  228.2  394.2   45.6
  534.2 &#124;  426.8   3.1  190.0  348.7   41.2 &#124;  107.4   0.1  176.1  310.1   40.5
  687.7 &#124;  550.7   2.1  148.4  349.6   39.1 &#124;  137.0   0.1  133.4  282.8   38.9
  941.7 &#124;  752.9   1.9  111.2  366.5   43.3 &#124;  188.7   0.1   86.5  243.8   41.3
 1258.9 &#124; 1005.9   1.6   87.0  603.9   54.9 &#124;  253.0   0.1   49.3  388.4   45.7
 1498.6 &#124; 1200.7   0.1   77.3 1056.2   72.1 &#124;  297.9   0.1   24.1  559.9   46.9
 1645.6 &#124; 1313.2   0.1   73.6 1323.9   83.3 &#124;  332.4   0.1   10.0  941.0   39.7
 1692.2 &#124; 1354.6   0.1   73.0 1436.4   86.1 &#124;  337.6   0.1    3.2  640.0   26.9
 1721.1 &#124; 1377.5   0.1   72.3 1921.1   85.1 &#124;  343.6   0.1    1.2  568.4   18.9
 1721.9 &#124; 1379.7   0.1   72.4 1562.9   83.7 &#124;  342.2   0.1    0.2   69.7    2.4
 1705.9 &#124; 1364.8   0.1   73.3 1559.8   85.1 &#124;  341.1   0.1    0.1    0.7    0.0
 1693.7 &#124; 1357.2   0.1   73.6 1480.5   87.1 &#124;  336.5   0.1    0.1    2.1    0.0
 1696.2 &#124; 1355.7   0.1   73.8 1520.0   86.4 &#124;  340.4   0.1    0.1    1.5    0.0

[root@udbqa006.sf2p ~]# lvcreate --size 60g --snapshot --name snap /dev/VolGroup20/mysqldata
  Logical volume &quot;snap&quot; created

  626.7 &#124;  502.0   0.1  168.7 2581.4  192.8 &#124;  124.7   0.1  121.5 2471.1  203.7
  310.3 &#124;  248.7   6.2  321.2  529.4   61.9 &#124;   61.6   7.0  326.0  518.9   62.9
  305.5 &#124;  244.0   2.1  326.1  595.8   61.2 &#124;   61.5  10.0  331.3  579.2   61.1
  313.3 &#124;  251.2   4.9  319.0  503.5   60.2 &#124;   62.1  12.8  322.3  496.6   59.9
  300.5 &#124;  239.7   4.1  331.4 1207.4   81.3 &#124;   60.7  10.0  335.7 1201.1   83.5
  312.7 &#124;  250.8   4.1  318.6  587.2   65.7 &#124;   61.8  10.0  324.8  574.1   67.0
  309.3 &#124;  247.5   3.3  322.5  540.3   58.9 &#124;   61.8   9.0  326.3  517.9   57.3

What&#039;s going on? I&#039;ve glanced over the code real quick and this is my speculation. There are 2 issues with ext3. The first one is that it is not an extent based filesystem. It is block based. What that means is that there are lots of indirect address blocks addressing your data blocks and they don&#039;t have to be all cached if they are not accessed frequently enough. So a read (or a write) might require a disk read in order to map your logical offset to the actual physical offset in the file. That means extra io&#039;s before you even get to read/write your actual data. xfs (and vxfs and zfs) are extent based filesystems and the block maps of large files can fit into an inode itself if the filesystem is not severely fragmented so mapping your offset is very fast (you can preallocate space in vxfs and in xfs - I use a mount option allocsize=1g).

The second problem is the level of parallelism a filesystem allows when it comes to competing reads/writes. I believe that ext3 is taking an exclusive rwlock on each write and doesn&#039;t release it through the whole io. On the other hand xfs (unless you are growing the file) takes and holds the lock in a shared mode if the file is opened in O_DIRECT mode. This can have big consequences.

Look at the slow start of ext3 - it took nearly 15 minutes to come to the raw speeds. Why is that? Look at the write latencies - at the beginning they are big - which we would not expect with write-back cache. I believe that they are so big because the indirect address blocks were not in the cache. So what should be a very fast write under normal circumstances (write-back cache) turns into a read and a write while holding the exclusive rwlock - blocking all the readers at the same time. Eventually (in 15 mins or so) all of the indirect address blocks are cached and ext3 resumes the &quot;raw&quot; performance.

But once the snapshot is created it plummets again. It&#039;s the same reason - a write turned into a COW read and a write will hold the exclusive rwlock for much longer than if it didn&#039;t have to do a COW operation and so it chokes readers once again. Bigger write ratio has even a more drastic effect on ext3 + snapshot performance.

&quot;randomio&quot; might not be a good representation of what MySQL is doing so we are currently testing xfs with MySQL. But if my speculation are valid then I would expect a smaller degradation with xfs.</description>
		<content:encoded><![CDATA[<p>There are basically two snapshot technologies &#8211; copy-on-write (COW) and pointer-based block-map technologies of the log structured file systems (WAFL, ZFS). With log structured filesystems you get snapshots practically for free but there are other disadvantages like fragmentation. The traditional filesystems use COW technology on various levels &#8211; volume manager (LVM, VXVM) or in the filesystem (VxFS). If you use LVM snapshots the choice of the filesystem can be quite important. I run &#8220;randomio&#8221; benchmark on ext3 filesystem and compared it with xfs. The difference with LVM snapshot is big. &#8220;randomio&#8221; opens its target with O_DIRECT and does random io with multiple threads. Write percentage controls the ratio of reads/writes done by each thread. My setup was 6 SAS 10K drives with 512MB write-back cache in the controller in RAID1+0 configuration. First the raw numbers &#8211; 100 threads, 20% writes, 4K iosize. Iops reported are iops per second.</p>
<p>[root@udbqa006.sf2p /home/pv/work/src/randomio-1.3]# ./randomio /dev/mapper/VolGroup20-mysqldata 100 0.2 0 4096 60<br />
  total |  read:         latency (ms)       |  write:        latency (ms)<br />
   iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev<br />
&#8212;&#8212;&#8211;+&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;+&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br />
 1793.9 | 1433.7   0.1   69.6 2853.9   82.4 |  360.2   0.1    0.1    4.2    0.1<br />
 1675.6 | 1340.0   0.1   74.6 1596.9   88.0 |  335.6   0.1    0.1    2.2    0.0</p>
<p>[root@udbqa006.sf2p ~]# lvcreate &#8211;size 60g &#8211;snapshot &#8211;name snap /dev/VolGroup20/mysqldata<br />
  Logical volume &#8220;snap&#8221; created</p>
<p> 1379.3 | 1104.6   0.1   68.7 1560.2   86.1 |  274.7   0.1   87.0 1555.4  178.1<br />
 1061.3 |  848.4   0.1   42.7  544.8   43.8 |  212.9   0.1  300.9  865.7  154.5<br />
 1082.1 |  863.4   0.1   43.2  572.5   45.5 |  218.7   0.1  286.7  950.5  145.4<br />
 1077.3 |  863.8   0.1   43.0  771.1   45.0 |  213.5   0.1  292.6  795.3  147.4</p>
<p>So that&#8217;s our base. Let&#8217;s try the same with xfs on a large file -</p>
<p>[root@udbqa006.sf2p ~]# mount<br />
/dev/mapper/VolGroup20-mysqldata on /data type xfs (rw,noatime,allocsize=1g)</p>
<p>[root@udbqa006.sf2p ~]# ls -l /data<br />
total 335544320<br />
-rw-r&#8211;r&#8211;  1 root  root  343597383680 Sep  9 16:05 foo</p>
<p>[root@udbqa006.sf2p /home/pv/work/src/randomio-1.3]# xfs_bmap -l /data/foo<br />
/data/foo:<br />
        0: [0..173950911]: 320..173951231 173950912 blocks<br />
        1: [173950912..341722991]: 183500864..351272943 167772080 blocks<br />
        2: [341722992..509495071]: 367263808..535035887 167772080 blocks<br />
        3: [509495072..671088639]: 550502464..712096031 161593568 blocks</p>
<p>[root@udbqa006.sf2p /home/pv/work/src/randomio-1.3]# ./randomio /data/foo 100 0.2 0 4096 60<br />
  total |  read:         latency (ms)       |  write:        latency (ms)<br />
   iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev<br />
&#8212;&#8212;&#8211;+&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;+&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br />
 1887.2 | 1508.4   0.1   66.2 1694.0   77.1 |  378.8   0.1    0.1    2.9    0.1<br />
 1678.6 | 1341.6   0.1   74.4 1584.6   87.0 |  337.0   0.1    0.1    1.1    0.0<br />
 1670.9 | 1334.9   0.1   75.0 1567.0   88.0 |  336.0   0.1    0.1    1.0    0.0</p>
<p>[root@udbqa006.sf2p ~]# lvcreate &#8211;size 60g &#8211;snapshot &#8211;name snap /dev/VolGroup20/mysqldata<br />
  Logical volume &#8220;snap&#8221; created</p>
<p> 1592.2 | 1271.3   0.1   74.7 1401.9   89.3 |  320.8   0.1   14.7 1801.1   95.8<br />
 1073.0 |  859.6   0.1   43.5  682.1   44.5 |  213.4   0.1  294.2  796.2  150.9<br />
 1090.7 |  872.6   0.1   42.9  593.9   44.0 |  218.1   0.1  287.3  717.1  146.0</p>
<p>Good &#8211; xfs matches raw &#8211; with or without snapshot. Now let&#8217;s look at ext3 with the same size file -</p>
<p>[root@udbqa006.sf2p /home/pv/work/src/randomio-1.3]# ./randomio /data/foo 100 0.2 0 4096 60<br />
  total |  read:         latency (ms)       |  write:        latency (ms)<br />
   iops |   iops   min    avg    max   sdev |   iops   min    avg    max   sdev<br />
&#8212;&#8212;&#8211;+&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8211;+&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;-<br />
  149.2 |  119.3   4.8  668.1 1042.5   54.1 |   30.0   5.2  660.9 1048.3   64.6<br />
  164.8 |  131.8   1.8  608.0  740.7   43.6 |   32.9 512.3  603.9  734.5   36.5<br />
  181.4 |  145.6   8.0  552.3  687.0   41.5 |   35.8 419.2  547.1  671.7   36.6<br />
  197.6 |  157.7   6.8  507.8 1118.2   67.9 |   39.9   0.1  498.9 1078.5   67.3<br />
  224.3 |  180.3   4.5  448.2  647.6   45.3 |   43.9   0.1  439.5  623.6   47.1<br />
  256.9 |  204.2   5.1  390.9  521.1   42.9 |   52.7   0.2  383.4  505.7   43.1<br />
  295.4 |  236.6   3.0  340.5  474.3   44.4 |   58.9   0.1  330.5  461.9   46.4<br />
  350.3 |  279.4   5.0  287.5  419.8   43.2 |   70.9   0.1  277.6  401.6   43.8<br />
  421.5 |  336.8   1.6  239.7  414.8   44.0 |   84.7   0.1  228.2  394.2   45.6<br />
  534.2 |  426.8   3.1  190.0  348.7   41.2 |  107.4   0.1  176.1  310.1   40.5<br />
  687.7 |  550.7   2.1  148.4  349.6   39.1 |  137.0   0.1  133.4  282.8   38.9<br />
  941.7 |  752.9   1.9  111.2  366.5   43.3 |  188.7   0.1   86.5  243.8   41.3<br />
 1258.9 | 1005.9   1.6   87.0  603.9   54.9 |  253.0   0.1   49.3  388.4   45.7<br />
 1498.6 | 1200.7   0.1   77.3 1056.2   72.1 |  297.9   0.1   24.1  559.9   46.9<br />
 1645.6 | 1313.2   0.1   73.6 1323.9   83.3 |  332.4   0.1   10.0  941.0   39.7<br />
 1692.2 | 1354.6   0.1   73.0 1436.4   86.1 |  337.6   0.1    3.2  640.0   26.9<br />
 1721.1 | 1377.5   0.1   72.3 1921.1   85.1 |  343.6   0.1    1.2  568.4   18.9<br />
 1721.9 | 1379.7   0.1   72.4 1562.9   83.7 |  342.2   0.1    0.2   69.7    2.4<br />
 1705.9 | 1364.8   0.1   73.3 1559.8   85.1 |  341.1   0.1    0.1    0.7    0.0<br />
 1693.7 | 1357.2   0.1   73.6 1480.5   87.1 |  336.5   0.1    0.1    2.1    0.0<br />
 1696.2 | 1355.7   0.1   73.8 1520.0   86.4 |  340.4   0.1    0.1    1.5    0.0</p>
<p>[root@udbqa006.sf2p ~]# lvcreate &#8211;size 60g &#8211;snapshot &#8211;name snap /dev/VolGroup20/mysqldata<br />
  Logical volume &#8220;snap&#8221; created</p>
<p>  626.7 |  502.0   0.1  168.7 2581.4  192.8 |  124.7   0.1  121.5 2471.1  203.7<br />
  310.3 |  248.7   6.2  321.2  529.4   61.9 |   61.6   7.0  326.0  518.9   62.9<br />
  305.5 |  244.0   2.1  326.1  595.8   61.2 |   61.5  10.0  331.3  579.2   61.1<br />
  313.3 |  251.2   4.9  319.0  503.5   60.2 |   62.1  12.8  322.3  496.6   59.9<br />
  300.5 |  239.7   4.1  331.4 1207.4   81.3 |   60.7  10.0  335.7 1201.1   83.5<br />
  312.7 |  250.8   4.1  318.6  587.2   65.7 |   61.8  10.0  324.8  574.1   67.0<br />
  309.3 |  247.5   3.3  322.5  540.3   58.9 |   61.8   9.0  326.3  517.9   57.3</p>
<p>What&#8217;s going on? I&#8217;ve glanced over the code real quick and this is my speculation. There are 2 issues with ext3. The first one is that it is not an extent based filesystem. It is block based. What that means is that there are lots of indirect address blocks addressing your data blocks and they don&#8217;t have to be all cached if they are not accessed frequently enough. So a read (or a write) might require a disk read in order to map your logical offset to the actual physical offset in the file. That means extra io&#8217;s before you even get to read/write your actual data. xfs (and vxfs and zfs) are extent based filesystems and the block maps of large files can fit into an inode itself if the filesystem is not severely fragmented so mapping your offset is very fast (you can preallocate space in vxfs and in xfs &#8211; I use a mount option allocsize=1g).</p>
<p>The second problem is the level of parallelism a filesystem allows when it comes to competing reads/writes. I believe that ext3 is taking an exclusive rwlock on each write and doesn&#8217;t release it through the whole io. On the other hand xfs (unless you are growing the file) takes and holds the lock in a shared mode if the file is opened in O_DIRECT mode. This can have big consequences.</p>
<p>Look at the slow start of ext3 &#8211; it took nearly 15 minutes to come to the raw speeds. Why is that? Look at the write latencies &#8211; at the beginning they are big &#8211; which we would not expect with write-back cache. I believe that they are so big because the indirect address blocks were not in the cache. So what should be a very fast write under normal circumstances (write-back cache) turns into a read and a write while holding the exclusive rwlock &#8211; blocking all the readers at the same time. Eventually (in 15 mins or so) all of the indirect address blocks are cached and ext3 resumes the &#8220;raw&#8221; performance.</p>
<p>But once the snapshot is created it plummets again. It&#8217;s the same reason &#8211; a write turned into a COW read and a write will hold the exclusive rwlock for much longer than if it didn&#8217;t have to do a COW operation and so it chokes readers once again. Bigger write ratio has even a more drastic effect on ext3 + snapshot performance.</p>
<p>&#8220;randomio&#8221; might not be a good representation of what MySQL is doing so we are currently testing xfs with MySQL. But if my speculation are valid then I would expect a smaller degradation with xfs.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
