<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Linux schedulers in tpcc like benchmark</title>
	<atom:link href="http://www.mysqlperformanceblog.com/2009/01/30/linux-schedulers-in-tpcc-like-benchmark/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.mysqlperformanceblog.com/2009/01/30/linux-schedulers-in-tpcc-like-benchmark/</link>
	<description>Everything about MySQL Performance</description>
	<lastBuildDate>Sat, 21 Nov 2009 05:23:57 -0800</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Vadim</title>
		<link>http://www.mysqlperformanceblog.com/2009/01/30/linux-schedulers-in-tpcc-like-benchmark/comment-page-1/#comment-669744</link>
		<dc:creator>Vadim</dc:creator>
		<pubDate>Wed, 28 Oct 2009 01:15:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=601#comment-669744</guid>
		<description>Fredrik,

Thank you for sharing results. They are in line with my experience.
With RAID controller, &#039;cfq&#039; is not suitable to use.</description>
		<content:encoded><![CDATA[<p>Fredrik,</p>
<p>Thank you for sharing results. They are in line with my experience.<br />
With RAID controller, &#8216;cfq&#8217; is not suitable to use.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Fredrik Widlund</title>
		<link>http://www.mysqlperformanceblog.com/2009/01/30/linux-schedulers-in-tpcc-like-benchmark/comment-page-1/#comment-669469</link>
		<dc:creator>Fredrik Widlund</dc:creator>
		<pubDate>Tue, 27 Oct 2009 10:24:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=601#comment-669469</guid>
		<description>Stumbled on this thread. I&#039;m benchmarking fs/schedulers on a small HW RAID-5 set of 4x400GB SATA. It&#039;s a lab-setup used for parallel sequential streams and not DB usage. Raid adapter is PERC6/E, 1MB raid element size, Cached, Adaptive Read-ahead. Kernel 2.6.31.5 (Arch Linux).

Very short recap with some numbers (40 simultaneous seq reads):
XFS+noop: 140MB/s
EXT3+noop: 133MB/s
EXT4+noop: 111MB/s (exploded once with corrupt data when changing scheduler)
EXT2+noop: 97MB/s
XFS+cfq: 30MB/s

Some comments on this setup:
- XFS beats EXT*, with EXT4 performing badly and also being unstable
- cfq vs noop makes a *lot* of difference. In this setup cfq seems close to broken. deadline is similar to noop.</description>
		<content:encoded><![CDATA[<p>Stumbled on this thread. I&#8217;m benchmarking fs/schedulers on a small HW RAID-5 set of 4&#215;400GB SATA. It&#8217;s a lab-setup used for parallel sequential streams and not DB usage. Raid adapter is PERC6/E, 1MB raid element size, Cached, Adaptive Read-ahead. Kernel 2.6.31.5 (Arch Linux).</p>
<p>Very short recap with some numbers (40 simultaneous seq reads):<br />
XFS+noop: 140MB/s<br />
EXT3+noop: 133MB/s<br />
EXT4+noop: 111MB/s (exploded once with corrupt data when changing scheduler)<br />
EXT2+noop: 97MB/s<br />
XFS+cfq: 30MB/s</p>
<p>Some comments on this setup:<br />
- XFS beats EXT*, with EXT4 performing badly and also being unstable<br />
- cfq vs noop makes a *lot* of difference. In this setup cfq seems close to broken. deadline is similar to noop.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: steven</title>
		<link>http://www.mysqlperformanceblog.com/2009/01/30/linux-schedulers-in-tpcc-like-benchmark/comment-page-1/#comment-559327</link>
		<dc:creator>steven</dc:creator>
		<pubDate>Mon, 11 May 2009 07:22:36 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=601#comment-559327</guid>
		<description>benpi: Is this what you&#039;re referring to in your first message ?  It makes sense to me and this is what I think is happening in this case with CFQ :

http://www.fishpool.org/post/2008/03/31/Optimizing-Linux-I/O-on-hardware-RAID

(read the last part of the article).</description>
		<content:encoded><![CDATA[<p>benpi: Is this what you&#8217;re referring to in your first message ?  It makes sense to me and this is what I think is happening in this case with CFQ :</p>
<p><a href="http://www.fishpool.org/post/2008/03/31/Optimizing-Linux-I/O-on-hardware-RAID" rel="nofollow">http://www.fishpool.org/post/2008/03/31/Optimizing-Linux-I/O-on-hardware-RAID</a></p>
<p>(read the last part of the article).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: dermoth</title>
		<link>http://www.mysqlperformanceblog.com/2009/01/30/linux-schedulers-in-tpcc-like-benchmark/comment-page-1/#comment-463727</link>
		<dc:creator>dermoth</dc:creator>
		<pubDate>Wed, 04 Feb 2009 17:03:48 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=601#comment-463727</guid>
		<description>I used Deadline with MySQL for many years... Recently I wanted to test the CFQ Idle class for an IO-intensive job running everyday and causing a slight lag in replication. The end-result was even worse, causing more lag in MySQL by running the job faster.

With &lt;2.6.24 kernels I even experienced a bug in which both MySQL and the job were running ridiculously slow with barely no IO operations on the RAID.

Might be different for other subsystems, but for be deadline just rocks, even without BBU (6-drive Raid10 though)</description>
		<content:encoded><![CDATA[<p>I used Deadline with MySQL for many years&#8230; Recently I wanted to test the CFQ Idle class for an IO-intensive job running everyday and causing a slight lag in replication. The end-result was even worse, causing more lag in MySQL by running the job faster.</p>
<p>With &lt;2.6.24 kernels I even experienced a bug in which both MySQL and the job were running ridiculously slow with barely no IO operations on the RAID.</p>
<p>Might be different for other subsystems, but for be deadline just rocks, even without BBU (6-drive Raid10 though)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: benpi</title>
		<link>http://www.mysqlperformanceblog.com/2009/01/30/linux-schedulers-in-tpcc-like-benchmark/comment-page-1/#comment-460459</link>
		<dc:creator>benpi</dc:creator>
		<pubDate>Sat, 31 Jan 2009 17:44:04 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=601#comment-460459</guid>
		<description>Peter,
yes, that&#039;s why I got mistaken: by thinking O_DIRECT would prevent to-be-written data to go from userland through kernel page cache before being sent to the controler, writes wouldn&#039;t be well scheduled (and believing in consequence that it would be a huge loss on bbu-less systems).  So I was triply dumb because the alternative to O_DIRECT is fsync()ing the data file after write, which doesn&#039;t leave much room for merging requests either, and because this (O_DIRECT) only apply for data moved from logs to data file (MySQL can probably order this well already), and also because schedulers (at least cfq) are also able to arbitrate which requests (not only &quot;which pages&quot;), even direct ones, can go to some device at any moment (so even then scheduler shouldn&#039;t be totaly bypassed if I&#039;m not mistaking again).

And sadly no, theres no per request priorities yet as far as I know (not even per posix thread yet, but just per process, which indeed is way more pratical to use for PostgreSQL than it is for MySQL right now, but that may deserve a small little smallish tiny fork, maybe ;). Same goes for ioprio_set(2).</description>
		<content:encoded><![CDATA[<p>Peter,<br />
yes, that&#8217;s why I got mistaken: by thinking O_DIRECT would prevent to-be-written data to go from userland through kernel page cache before being sent to the controler, writes wouldn&#8217;t be well scheduled (and believing in consequence that it would be a huge loss on bbu-less systems).  So I was triply dumb because the alternative to O_DIRECT is fsync()ing the data file after write, which doesn&#8217;t leave much room for merging requests either, and because this (O_DIRECT) only apply for data moved from logs to data file (MySQL can probably order this well already), and also because schedulers (at least cfq) are also able to arbitrate which requests (not only &#8220;which pages&#8221;), even direct ones, can go to some device at any moment (so even then scheduler shouldn&#8217;t be totaly bypassed if I&#8217;m not mistaking again).</p>
<p>And sadly no, theres no per request priorities yet as far as I know (not even per posix thread yet, but just per process, which indeed is way more pratical to use for PostgreSQL than it is for MySQL right now, but that may deserve a small little smallish tiny fork, maybe <img src='http://www.mysqlperformanceblog.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> . Same goes for ioprio_set(2).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: peter</title>
		<link>http://www.mysqlperformanceblog.com/2009/01/30/linux-schedulers-in-tpcc-like-benchmark/comment-page-1/#comment-460429</link>
		<dc:creator>peter</dc:creator>
		<pubDate>Sat, 31 Jan 2009 15:50:36 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=601#comment-460429</guid>
		<description>benpi,

Did Linux get per request priorities finally  ? I worked with an University some time ago to see how IO priorities can be useful and they were quite helpful allowing to work with large queue sizes without starving log writes etc.

Now about disk IO scheduling and cache.  Really it is a bit different though connected.  The  dirty pages to be flushed are picked by the kernel (background flush or forced by fsync) and when submitted to the kernel together with reads request.     If you have a lot of dirty pages there is indeed a better chance  to optimize IO.</description>
		<content:encoded><![CDATA[<p>benpi,</p>
<p>Did Linux get per request priorities finally  ? I worked with an University some time ago to see how IO priorities can be useful and they were quite helpful allowing to work with large queue sizes without starving log writes etc.</p>
<p>Now about disk IO scheduling and cache.  Really it is a bit different though connected.  The  dirty pages to be flushed are picked by the kernel (background flush or forced by fsync) and when submitted to the kernel together with reads request.     If you have a lot of dirty pages there is indeed a better chance  to optimize IO.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: peter</title>
		<link>http://www.mysqlperformanceblog.com/2009/01/30/linux-schedulers-in-tpcc-like-benchmark/comment-page-1/#comment-460426</link>
		<dc:creator>peter</dc:creator>
		<pubDate>Sat, 31 Jan 2009 15:44:06 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=601#comment-460426</guid>
		<description>Domas,

Yeah XFS rocks in terms of not serializing O_DIRECT writes and also does not have contention problems on meta data updates
http://www.mysqlperformanceblog.com/2009/01/21/beware-ext3-and-sync-binlog-do-not-play-well-together/

The problem is most people run on EXT3 and it is hard to make them change :)</description>
		<content:encoded><![CDATA[<p>Domas,</p>
<p>Yeah XFS rocks in terms of not serializing O_DIRECT writes and also does not have contention problems on meta data updates<br />
<a href="http://www.mysqlperformanceblog.com/2009/01/21/beware-ext3-and-sync-binlog-do-not-play-well-together/" rel="nofollow">http://www.mysqlperformanceblog.com/2009/01/21/beware-ext3-and-sync-binlog-do-not-play-well-together/</a></p>
<p>The problem is most people run on EXT3 and it is hard to make them change <img src='http://www.mysqlperformanceblog.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: benpi</title>
		<link>http://www.mysqlperformanceblog.com/2009/01/30/linux-schedulers-in-tpcc-like-benchmark/comment-page-1/#comment-460375</link>
		<dc:creator>benpi</dc:creator>
		<pubDate>Sat, 31 Jan 2009 13:41:06 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=601#comment-460375</guid>
		<description>Vadim,
fair enough, RAID 10 + BBU is probably the most common and natural choice for write stressed servers (ps: I asked about RAM size because I still don&#039;t unders
tand why hardware vendors don&#039;t load them with lots more RAM (even if this imply a bigger battery, that shouldn&#039;t cost much more), a possible answer would be
 that performances wouldn&#039;t improve much after a certain size is reached).

Domas,
does this differs from ext3&#039;s data=writeback side effects?

Peter,
my comment about O_DIRECT w/o BBU was based on the (very naive and unverified) believing that Linux I/O schedulers do reorder based on kernel&#039;s page cache content (so, it seems I was wrong), and on the fact that software based reordering matters when one don&#039;t have hardware to handle it, MySQL using O_DIRECT to move data from sequentially written log files to the more write-costly spreaded/random innodb data file. Whatever, I stand corrected now, thank you :)

You are also very on point when you bring up latency in the discussion (and also queues size/depths); indeed scheduling is not only about aggregating write requests and bulk performances, but also about arbitrating resources usage among consumers and making decisions about avg vs. worst case latencies vs. throughput, which sometimes does matters.

I suspect a mixed workload (alternated write intensives, reads, and cpu bound moments, not just huge sustained writes) would give controller better chances to reorder properly before cache being filled up. If so, scheduler added complexity may be even more a waste, but in those conditions, user&#039;s perceived performances may be more reflected by per-request time to completion (so, latency). I mean workloads including possibles concurrents reads and cpu work, idling moments, etc. as opposed to steady, bulk, write-only workloads (ie. importing dumps), where one may rather account the total, wall clock time needed to write the whole dataset without interruptions.  Having a nicely interactive ajax interface stuck because some bulk thing does fill the BBU cache, the I/O queue and the innodb log file in the background may not be the expected behaviour, even if it gives a lower cumulated time to complete (therefore more TPS).

On this matter: thanks to the recent work on cgroups and containers resources control, Linux got a lot of refining in order to expose and enforce I/O priorities and arbitration. I&#039;m not sure whether MySQL do leverage those features though (maybe it would be useful to lower priority for logs-&gt;data moves, or to increase priorities when big locks are held, or when deadlocks starts to happen?) neither how well do the different I/O elevators handles those features.  And to the best of my knowledge, SQL does lack a way to specify expected QoS per request (I mean, something like TCP/IP TOS field) which would allow exposing this to consumers (although MySQL offers INSERT DELAYED).

Back to the subject, I don&#039;t know anything about tpcc (and even less about how it&#039;s used here): does this bench really spreads writes to emulate some randomness (wrt block device layout) or is it more of a &quot;plain sequential bulk insert/update on a very few tables&quot; thing?

Seekwatcher (http://oss.oracle.com/~mason/seekwatcher/) graphs would be interesting for such benchmarks, if anything to show what patterns you actually do measure (that&#039;s to insidiously revive the &quot;Tools&quot; post ;).</description>
		<content:encoded><![CDATA[<p>Vadim,<br />
fair enough, RAID 10 + BBU is probably the most common and natural choice for write stressed servers (ps: I asked about RAM size because I still don&#8217;t unders<br />
tand why hardware vendors don&#8217;t load them with lots more RAM (even if this imply a bigger battery, that shouldn&#8217;t cost much more), a possible answer would be<br />
 that performances wouldn&#8217;t improve much after a certain size is reached).</p>
<p>Domas,<br />
does this differs from ext3&#8217;s data=writeback side effects?</p>
<p>Peter,<br />
my comment about O_DIRECT w/o BBU was based on the (very naive and unverified) believing that Linux I/O schedulers do reorder based on kernel&#8217;s page cache content (so, it seems I was wrong), and on the fact that software based reordering matters when one don&#8217;t have hardware to handle it, MySQL using O_DIRECT to move data from sequentially written log files to the more write-costly spreaded/random innodb data file. Whatever, I stand corrected now, thank you <img src='http://www.mysqlperformanceblog.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>You are also very on point when you bring up latency in the discussion (and also queues size/depths); indeed scheduling is not only about aggregating write requests and bulk performances, but also about arbitrating resources usage among consumers and making decisions about avg vs. worst case latencies vs. throughput, which sometimes does matters.</p>
<p>I suspect a mixed workload (alternated write intensives, reads, and cpu bound moments, not just huge sustained writes) would give controller better chances to reorder properly before cache being filled up. If so, scheduler added complexity may be even more a waste, but in those conditions, user&#8217;s perceived performances may be more reflected by per-request time to completion (so, latency). I mean workloads including possibles concurrents reads and cpu work, idling moments, etc. as opposed to steady, bulk, write-only workloads (ie. importing dumps), where one may rather account the total, wall clock time needed to write the whole dataset without interruptions.  Having a nicely interactive ajax interface stuck because some bulk thing does fill the BBU cache, the I/O queue and the innodb log file in the background may not be the expected behaviour, even if it gives a lower cumulated time to complete (therefore more TPS).</p>
<p>On this matter: thanks to the recent work on cgroups and containers resources control, Linux got a lot of refining in order to expose and enforce I/O priorities and arbitration. I&#8217;m not sure whether MySQL do leverage those features though (maybe it would be useful to lower priority for logs-&gt;data moves, or to increase priorities when big locks are held, or when deadlocks starts to happen?) neither how well do the different I/O elevators handles those features.  And to the best of my knowledge, SQL does lack a way to specify expected QoS per request (I mean, something like TCP/IP TOS field) which would allow exposing this to consumers (although MySQL offers INSERT DELAYED).</p>
<p>Back to the subject, I don&#8217;t know anything about tpcc (and even less about how it&#8217;s used here): does this bench really spreads writes to emulate some randomness (wrt block device layout) or is it more of a &#8220;plain sequential bulk insert/update on a very few tables&#8221; thing?</p>
<p>Seekwatcher (<a href="http://oss.oracle.com/~mason/seekwatcher/" rel="nofollow">http://oss.oracle.com/~mason/seekwatcher/</a>) graphs would be interesting for such benchmarks, if anything to show what patterns you actually do measure (that&#8217;s to insidiously revive the &#8220;Tools&#8221; post <img src='http://www.mysqlperformanceblog.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> .</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Domas Mituzas</title>
		<link>http://www.mysqlperformanceblog.com/2009/01/30/linux-schedulers-in-tpcc-like-benchmark/comment-page-1/#comment-460311</link>
		<dc:creator>Domas Mituzas</dc:creator>
		<pubDate>Sat, 31 Jan 2009 10:20:16 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=601#comment-460311</guid>
		<description>IO serialization doesn&#039;t happen with XFS :) 

And there&#039;s no difference between noop and deadline is simply because deadline is noop with starve protection. If you do not starve too much, there&#039;s not much need for deadline.</description>
		<content:encoded><![CDATA[<p>IO serialization doesn&#8217;t happen with XFS <img src='http://www.mysqlperformanceblog.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  </p>
<p>And there&#8217;s no difference between noop and deadline is simply because deadline is noop with starve protection. If you do not starve too much, there&#8217;s not much need for deadline.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: peter</title>
		<link>http://www.mysqlperformanceblog.com/2009/01/30/linux-schedulers-in-tpcc-like-benchmark/comment-page-1/#comment-460273</link>
		<dc:creator>peter</dc:creator>
		<pubDate>Sat, 31 Jan 2009 07:21:13 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=601#comment-460273</guid>
		<description>benpi,

An interesting note about O_DIRECT though... and so how it can affect results.  In this kernel (at least) O_DIRECT serialized writes on inode level which can affect things significantly. 

I also did not quite get the point as without BBU and so without O_DIRECT - even if you do not have BBU O_DIRECT is often very helpful allowing to avoid double buffering.  It may be slower though in case workload is write bound because IO serialization for O_DIRECT.</description>
		<content:encoded><![CDATA[<p>benpi,</p>
<p>An interesting note about O_DIRECT though&#8230; and so how it can affect results.  In this kernel (at least) O_DIRECT serialized writes on inode level which can affect things significantly. </p>
<p>I also did not quite get the point as without BBU and so without O_DIRECT &#8211; even if you do not have BBU O_DIRECT is often very helpful allowing to avoid double buffering.  It may be slower though in case workload is write bound because IO serialization for O_DIRECT.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
