<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Why you don&#8217;t want to shard.</title>
	<atom:link href="http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/</link>
	<description>Everything about MySQL Performance</description>
	<lastBuildDate>Thu, 29 Jul 2010 19:06:57 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=8343</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Morgan Tocker</title>
		<link>http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/comment-page-1/#comment-682806</link>
		<dc:creator>Morgan Tocker</dc:creator>
		<pubDate>Mon, 23 Nov 2009 17:28:20 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=911#comment-682806</guid>
		<description>@ Anthony - I wrote under bullet point 2, that sharding was often a response to being write heavy (&quot;too many writes&quot;).  I didn&#039;t forget about replication, this article just has a specific purpose ;)

Most applications can be broken down into shards (see my comment #6 for examples), but I don&#039;t dispute this can be difficult in others.  The example I often give for an application that won&#039;t shard is IMDB&#039;s database.  I don&#039;t think there are many good ways to divide actors up, and the movies they star in.

A small correction to your point about locking: Readers don&#039;t block writers in InnoDB because of MVCC, but MySQL does have locking.  Related to your point though is cross-box consistency, and it is an issue.  Peter wrote about this in comment #23.</description>
		<content:encoded><![CDATA[<p>@ Anthony &#8211; I wrote under bullet point 2, that sharding was often a response to being write heavy (&#8220;too many writes&#8221;).  I didn&#8217;t forget about replication, this article just has a specific purpose <img src='http://www.mysqlperformanceblog.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
<p>Most applications can be broken down into shards (see my comment #6 for examples), but I don&#8217;t dispute this can be difficult in others.  The example I often give for an application that won&#8217;t shard is IMDB&#8217;s database.  I don&#8217;t think there are many good ways to divide actors up, and the movies they star in.</p>
<p>A small correction to your point about locking: Readers don&#8217;t block writers in InnoDB because of MVCC, but MySQL does have locking.  Related to your point though is cross-box consistency, and it is an issue.  Peter wrote about this in comment #23.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Peter van Dijk</title>
		<link>http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/comment-page-1/#comment-682446</link>
		<dc:creator>Peter van Dijk</dc:creator>
		<pubDate>Mon, 23 Nov 2009 04:10:21 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=911#comment-682446</guid>
		<description>@Anthony,
I think it might be helpful to consider that sharding can be used as another level of abstraction in a complex system, specifically, (and obviously this is a fairly gross oversimplification, but probably still valid)

Where a raw disk has a filesystem placed on top of it to aid in organisation of the underlying data,
a database server typically will use table structures on top of a filesystem to further abstract the low level operations of storing information in files into something that can be searched, modified and more easily maintained in a structured form.

Similarly, shards, when implemented in a useful way, are able to abstract a given system in such a way that you&#039;re able to distribute storage across an arbitrary number of machines. In our case, we have shards in different physical locations, where things like replication are completely impractical.

By extension, the reason that sharding isnt really a good idea for most people is the same reason that, for example, if you want to copy your holiday photos onto a usb thumb drive, you dont use a database to do it. In many cases, that extra level of abstraction is completely useless and simply adds complexity.

There are a lot of people who have spent a lot of time researching this area, and, particularly in the web world, it is an invaluable tool for dealing with enormous data sets. I think the notion that it&#039;s &#039;hack that ignores 50 years of database theory&#039; probably just indicates the need for better education and understanding of how it can be used as a tool.</description>
		<content:encoded><![CDATA[<p>@Anthony,<br />
I think it might be helpful to consider that sharding can be used as another level of abstraction in a complex system, specifically, (and obviously this is a fairly gross oversimplification, but probably still valid)</p>
<p>Where a raw disk has a filesystem placed on top of it to aid in organisation of the underlying data,<br />
a database server typically will use table structures on top of a filesystem to further abstract the low level operations of storing information in files into something that can be searched, modified and more easily maintained in a structured form.</p>
<p>Similarly, shards, when implemented in a useful way, are able to abstract a given system in such a way that you&#8217;re able to distribute storage across an arbitrary number of machines. In our case, we have shards in different physical locations, where things like replication are completely impractical.</p>
<p>By extension, the reason that sharding isnt really a good idea for most people is the same reason that, for example, if you want to copy your holiday photos onto a usb thumb drive, you dont use a database to do it. In many cases, that extra level of abstraction is completely useless and simply adds complexity.</p>
<p>There are a lot of people who have spent a lot of time researching this area, and, particularly in the web world, it is an invaluable tool for dealing with enormous data sets. I think the notion that it&#8217;s &#8216;hack that ignores 50 years of database theory&#8217; probably just indicates the need for better education and understanding of how it can be used as a tool.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Anthony Berglas</title>
		<link>http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/comment-page-1/#comment-682354</link>
		<dc:creator>Anthony Berglas</dc:creator>
		<pubDate>Mon, 23 Nov 2009 00:06:04 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=911#comment-682354</guid>
		<description>MASTER/SLAVE REPLICATION

You forgot to mention that if there are many more reads than writes (common case) then running slave, read only databases off a master provides scalability without having to resort to sharding.

Also, Sharding only works if the shards are largely independent, eg. GMail user accounts.  But sharding and an integrated system such as ERP is likely to slow it down as the shards need to communicate. 

Some databases (Oracle) can horizontally partition a table (and I hope thus a database) automatically based on key values.  That is the right approach.  Keep the logical/physcial separation.  Google style sharding and big table are a hack that ignores 50 years of database theory.

You also forgot to mention that if you take a couple of big tables out of a database, you loose locking and transactions.  Not a good option. (Oh, I forgot, MySql does not have locking anyway ;).)</description>
		<content:encoded><![CDATA[<p>MASTER/SLAVE REPLICATION</p>
<p>You forgot to mention that if there are many more reads than writes (common case) then running slave, read only databases off a master provides scalability without having to resort to sharding.</p>
<p>Also, Sharding only works if the shards are largely independent, eg. GMail user accounts.  But sharding and an integrated system such as ERP is likely to slow it down as the shards need to communicate. </p>
<p>Some databases (Oracle) can horizontally partition a table (and I hope thus a database) automatically based on key values.  That is the right approach.  Keep the logical/physcial separation.  Google style sharding and big table are a hack that ignores 50 years of database theory.</p>
<p>You also forgot to mention that if you take a couple of big tables out of a database, you loose locking and transactions.  Not a good option. (Oh, I forgot, MySql does not have locking anyway <img src='http://www.mysqlperformanceblog.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> .)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dathan Vance Pattishall</title>
		<link>http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/comment-page-1/#comment-650159</link>
		<dc:creator>Dathan Vance Pattishall</dc:creator>
		<pubDate>Wed, 09 Sep 2009 00:03:17 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=911#comment-650159</guid>
		<description>I do agree you should only use it if you need to do realtime queries that are user facing across a very large dataset (10&#039;s of TBs)

Sharding is super easy if you know what you&#039;re doing. 1,2 are not an issue at all for me. I can isolate all traffic for super powers users to an in memory DB which will not be overrun if done correctly.</description>
		<content:encoded><![CDATA[<p>I do agree you should only use it if you need to do realtime queries that are user facing across a very large dataset (10&#8217;s of TBs)</p>
<p>Sharding is super easy if you know what you&#8217;re doing. 1,2 are not an issue at all for me. I can isolate all traffic for super powers users to an in memory DB which will not be overrun if done correctly.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Log Buffer</title>
		<link>http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/comment-page-1/#comment-631747</link>
		<dc:creator>Log Buffer</dc:creator>
		<pubDate>Mon, 17 Aug 2009 16:12:44 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=911#comment-631747</guid>
		<description>&quot;On the MySQL Performance Blog, Morgan Tocker explains why you don’t want to shard. (It has nothing to do with The Dark Crystal, I already checked.) [...]&quot;

&lt;a href=&quot;http://www.pythian.com/news/3561/log-buffer-158-a-carnival-of-the-vanities-for-dbas&quot; rel=&quot;nofollow&quot;&gt;Log Buffer #158&lt;/a&gt;</description>
		<content:encoded><![CDATA[<p>&#8220;On the MySQL Performance Blog, Morgan Tocker explains why you don’t want to shard. (It has nothing to do with The Dark Crystal, I already checked.) [...]&#8221;</p>
<p><a href="http://www.pythian.com/news/3561/log-buffer-158-a-carnival-of-the-vanities-for-dbas" rel="nofollow">Log Buffer #158</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: peter</title>
		<link>http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/comment-page-1/#comment-630023</link>
		<dc:creator>peter</dc:creator>
		<pubDate>Fri, 14 Aug 2009 18:47:31 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=911#comment-630023</guid>
		<description>Peter,

We&#039;re not against sharding. In fact we help a lot of people how to shard properly. The problem is now it is such a buzz word so people with 1GB data set start sharing even if it is never going to grow over 10GB

The bad design is one issue the other however simply working with sharded data.  Really it is a lot depended on how tightly coupled is your data - for example hosting million of separate blogs is very easy to shard because there is no interdependencies.

The large data also indeed causes operational concerns - databases in TB range are often not fun in MySQL due to challenges with backups and expecially things as alter table.  http://www.mysqlperformanceblog.com/2006/10/08/small-things-are-better/

With backups - the concern is cross box consistency.  With single box you can restored backup from yesterday and it will be consistent (even though not up to date) - in sharded envinronment backups will correspond to different point in time and so would not be consistent.</description>
		<content:encoded><![CDATA[<p>Peter,</p>
<p>We&#8217;re not against sharding. In fact we help a lot of people how to shard properly. The problem is now it is such a buzz word so people with 1GB data set start sharing even if it is never going to grow over 10GB</p>
<p>The bad design is one issue the other however simply working with sharded data.  Really it is a lot depended on how tightly coupled is your data &#8211; for example hosting million of separate blogs is very easy to shard because there is no interdependencies.</p>
<p>The large data also indeed causes operational concerns &#8211; databases in TB range are often not fun in MySQL due to challenges with backups and expecially things as alter table.  <a href="http://www.mysqlperformanceblog.com/2006/10/08/small-things-are-better/" rel="nofollow">http://www.mysqlperformanceblog.com/2006/10/08/small-things-are-better/</a></p>
<p>With backups &#8211; the concern is cross box consistency.  With single box you can restored backup from yesterday and it will be consistent (even though not up to date) &#8211; in sharded envinronment backups will correspond to different point in time and so would not be consistent.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: peter</title>
		<link>http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/comment-page-1/#comment-630016</link>
		<dc:creator>peter</dc:creator>
		<pubDate>Fri, 14 Aug 2009 18:33:48 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=911#comment-630016</guid>
		<description>Brooks Johnson,

The functional partitioning makes sense under 2 conditions 

1) The functional partitions are independent enough, hence you do not need to join data frequently between them at all.  Putting different tables on different hosts is not the idea, putting &quot;Forum&quot; on one database host &quot;Wiki&quot; on another and &quot;Bug System&quot; on the third is.

2) The gain you&#039;re looking for is relatively small.  It is often easy to find 3 independent functions with one of them responsible for 50% of the load (and hence split giving you double capacity) but getting 10x this way is rarely possible</description>
		<content:encoded><![CDATA[<p>Brooks Johnson,</p>
<p>The functional partitioning makes sense under 2 conditions </p>
<p>1) The functional partitions are independent enough, hence you do not need to join data frequently between them at all.  Putting different tables on different hosts is not the idea, putting &#8220;Forum&#8221; on one database host &#8220;Wiki&#8221; on another and &#8220;Bug System&#8221; on the third is.</p>
<p>2) The gain you&#8217;re looking for is relatively small.  It is often easy to find 3 independent functions with one of them responsible for 50% of the load (and hence split giving you double capacity) but getting 10x this way is rarely possible</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Dave</title>
		<link>http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/comment-page-1/#comment-629104</link>
		<dc:creator>Dave</dc:creator>
		<pubDate>Thu, 13 Aug 2009 07:18:35 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=911#comment-629104</guid>
		<description>@Baron and @Morgan,
Thanks again for the info, we actually archive our data yearly, and only store about 4 rolling years of data - unfortunately, most (if not all) of it is in regular use...

I have got some interesting numbers back from my analysis though;

My main area of concern was the replacement of data with refreshed information. This involved DELETEing large chunks and INSERTing new data. There is a single field that is primarily used when doing the DELETE, which contains roughly 200 unique entries with a fairly even distribution (eg DELETE FROM `table_name` WHERE `field_name` = 5).

After partitioning the data into 10 chunks based on the HASH of this field, we have experienced a four-fold improvement in the DELETE command, which is now able to DELETE 8.2m records from a 1.5b record table faster than a similar DELETE of 3.6m records from a 500m record table.

Maybe our application just happens to be one of those lucky ones that partitioning _may_ help!

I have yet to get some proper performance numbers on the SELECT side of the data, but I shouldn&#039;t imagine it will be any slower than the current layout, and the data replace section is such a key one, I&#039;m prepared to forego a small performance hit for the increases I&#039;ve already mentioned.</description>
		<content:encoded><![CDATA[<p>@Baron and @Morgan,<br />
Thanks again for the info, we actually archive our data yearly, and only store about 4 rolling years of data &#8211; unfortunately, most (if not all) of it is in regular use&#8230;</p>
<p>I have got some interesting numbers back from my analysis though;</p>
<p>My main area of concern was the replacement of data with refreshed information. This involved DELETEing large chunks and INSERTing new data. There is a single field that is primarily used when doing the DELETE, which contains roughly 200 unique entries with a fairly even distribution (eg DELETE FROM `table_name` WHERE `field_name` = 5).</p>
<p>After partitioning the data into 10 chunks based on the HASH of this field, we have experienced a four-fold improvement in the DELETE command, which is now able to DELETE 8.2m records from a 1.5b record table faster than a similar DELETE of 3.6m records from a 500m record table.</p>
<p>Maybe our application just happens to be one of those lucky ones that partitioning _may_ help!</p>
<p>I have yet to get some proper performance numbers on the SELECT side of the data, but I shouldn&#8217;t imagine it will be any slower than the current layout, and the data replace section is such a key one, I&#8217;m prepared to forego a small performance hit for the increases I&#8217;ve already mentioned.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Morgan Tocker</title>
		<link>http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/comment-page-1/#comment-628653</link>
		<dc:creator>Morgan Tocker</dc:creator>
		<pubDate>Wed, 12 Aug 2009 14:48:11 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=911#comment-628653</guid>
		<description>@Baron - thanks for being a little less politically correct than me ;)  I re-read my comment and realized I said &quot;it may help&quot;, but I really meant &quot;It may help, but I don&#039;t think it will fix it&quot;.

@Dave - When Baron was talking about archiving, he was probably implying mk-archiver - http://www.maatkit.org/doc/mk-archiver.html.  It&#039;s an excellent tool if you find you don&#039;t need older records.</description>
		<content:encoded><![CDATA[<p>@Baron &#8211; thanks for being a little less politically correct than me <img src='http://www.mysqlperformanceblog.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' />   I re-read my comment and realized I said &#8220;it may help&#8221;, but I really meant &#8220;It may help, but I don&#8217;t think it will fix it&#8221;.</p>
<p>@Dave &#8211; When Baron was talking about archiving, he was probably implying mk-archiver &#8211; <a href="http://www.maatkit.org/doc/mk-archiver.html" rel="nofollow">http://www.maatkit.org/doc/mk-archiver.html</a>.  It&#8217;s an excellent tool if you find you don&#8217;t need older records.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Baron Schwartz</title>
		<link>http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/comment-page-1/#comment-628588</link>
		<dc:creator>Baron Schwartz</dc:creator>
		<pubDate>Wed, 12 Aug 2009 12:45:23 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=911#comment-628588</guid>
		<description>Dave, I honestly doubt that partitioning is the miracle solution for you.  It is no silver bullet.

I think if the powers that be understood the real cost of MyISAM, that equation would flip on its head.

Maybe you can think about archiving.</description>
		<content:encoded><![CDATA[<p>Dave, I honestly doubt that partitioning is the miracle solution for you.  It is no silver bullet.</p>
<p>I think if the powers that be understood the real cost of MyISAM, that equation would flip on its head.</p>
<p>Maybe you can think about archiving.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
