<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: High-Performance Click Analysis with MySQL</title>
	<atom:link href="http://www.mysqlperformanceblog.com/2008/12/22/high-performance-click-analysis-with-mysql/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.mysqlperformanceblog.com/2008/12/22/high-performance-click-analysis-with-mysql/</link>
	<description>Everything about MySQL Performance</description>
	<lastBuildDate>Sat, 21 Nov 2009 05:23:57 -0800</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Victoria Eastwood</title>
		<link>http://www.mysqlperformanceblog.com/2008/12/22/high-performance-click-analysis-with-mysql/comment-page-1/#comment-436437</link>
		<dc:creator>Victoria Eastwood</dc:creator>
		<pubDate>Tue, 06 Jan 2009 13:42:32 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=562#comment-436437</guid>
		<description>Hi Baron,

Very nice blog entry and thanks for the favorable mention of Infobright. We would be more than happy to get you or one of your team started with Infobright. Since Infobright is column oriented and sports very high compression, you will likely find that some of your guidance will change. 

Cheers</description>
		<content:encoded><![CDATA[<p>Hi Baron,</p>
<p>Very nice blog entry and thanks for the favorable mention of Infobright. We would be more than happy to get you or one of your team started with Infobright. Since Infobright is column oriented and sports very high compression, you will likely find that some of your guidance will change. </p>
<p>Cheers</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Baron Schwartz</title>
		<link>http://www.mysqlperformanceblog.com/2008/12/22/high-performance-click-analysis-with-mysql/comment-page-1/#comment-426274</link>
		<dc:creator>Baron Schwartz</dc:creator>
		<pubDate>Sun, 28 Dec 2008 00:04:54 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=562#comment-426274</guid>
		<description>Golan, for really mission critical tasks I sometimes like to eliminate a database, but that&#039;s another post ;)</description>
		<content:encoded><![CDATA[<p>Golan, for really mission critical tasks I sometimes like to eliminate a database, but that&#8217;s another post <img src='http://www.mysqlperformanceblog.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Baron Schwartz</title>
		<link>http://www.mysqlperformanceblog.com/2008/12/22/high-performance-click-analysis-with-mysql/comment-page-1/#comment-426273</link>
		<dc:creator>Baron Schwartz</dc:creator>
		<pubDate>Sun, 28 Dec 2008 00:03:26 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=562#comment-426273</guid>
		<description>I should clarify what I said about pre-aggregating before putting data into the database.  Peter is right -- for general-purpose queries on tables without indexes, MyISAM will be disk-bound anyway and you can&#039;t easily do it faster than the database can.  But besides the reasons I already mentioned, two cases when you can beat the database come to mind:

1) When you&#039;re doing complex analysis that isn&#039;t easily or efficiently expressible in SQL.  &quot;Finding the most interesting rows and various other stuff related to them&quot; for example.

2) When you&#039;re only going to query the data one time as an intermediate step.  If you just want to do a once-off aggregation, then reading it from disk, putting it into a table (writing it to disk), reading it from the table (from disk, again), aggregating it (possibly involving temp tables on disk and/or filesorting on disk), and writing it to another table (back to disk) can be less efficient than just reading it from disk into your script, aggregating it in memory and writing it to the final table.</description>
		<content:encoded><![CDATA[<p>I should clarify what I said about pre-aggregating before putting data into the database.  Peter is right &#8212; for general-purpose queries on tables without indexes, MyISAM will be disk-bound anyway and you can&#8217;t easily do it faster than the database can.  But besides the reasons I already mentioned, two cases when you can beat the database come to mind:</p>
<p>1) When you&#8217;re doing complex analysis that isn&#8217;t easily or efficiently expressible in SQL.  &#8220;Finding the most interesting rows and various other stuff related to them&#8221; for example.</p>
<p>2) When you&#8217;re only going to query the data one time as an intermediate step.  If you just want to do a once-off aggregation, then reading it from disk, putting it into a table (writing it to disk), reading it from the table (from disk, again), aggregating it (possibly involving temp tables on disk and/or filesorting on disk), and writing it to another table (back to disk) can be less efficient than just reading it from disk into your script, aggregating it in memory and writing it to the final table.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: alz</title>
		<link>http://www.mysqlperformanceblog.com/2008/12/22/high-performance-click-analysis-with-mysql/comment-page-1/#comment-422498</link>
		<dc:creator>alz</dc:creator>
		<pubDate>Wed, 24 Dec 2008 16:27:43 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=562#comment-422498</guid>
		<description>Thanks, very interesting. I would say it is &quot;must read&quot; for anybody who starts doing that sort of things. 

We use MySQL based reporting for our internal ad serving analysis and &quot;discovered&quot; most of your recommendations ourselves. Our solution has several layers of aggregation and clustering but if finally ends in one or multiple MySQL servers used for end user reporting. Works fine and scales good.

Some ideas from our experience:
- we do not use replication. Instead we maintain several parallel virtually identical systems. If one fails, we can either copy from another or switch.
- we aggregate some data directly on runtime servers, then in intermediate MySQL nodes and finally when building reporting-ready aggregates. 
- aggregation is going on incrementally, so reporting delay for end users is below 1 hour
- Peter is right, aggregation in big (but not too big) chunks is much faster
- we use InnoDB for dimensions that fit in memory and MyISAM for aggregates. We consider using InnoDB though. However, MyISAM allows to copy files between servers, that is very convenient sometimes.
- we store raw data just in case but we do not use it. When it gets too big, we just delete it. Raw data required for aggregates processing is truncated daily.
- we do not use a lot of de-normalization, just try to avoid snow-flakes. If we do need de-normalization, we do it on dimension level but keep aggregate tables compact. After all, MySQL is not column oriented storage, so big tables should have the smallest row length possible for faster access.

It is interesting, that we tried to use Oracle for end user reporting server, but now switching back to MySQL. Oracle is too difficult to maintain.</description>
		<content:encoded><![CDATA[<p>Thanks, very interesting. I would say it is &#8220;must read&#8221; for anybody who starts doing that sort of things. </p>
<p>We use MySQL based reporting for our internal ad serving analysis and &#8220;discovered&#8221; most of your recommendations ourselves. Our solution has several layers of aggregation and clustering but if finally ends in one or multiple MySQL servers used for end user reporting. Works fine and scales good.</p>
<p>Some ideas from our experience:<br />
- we do not use replication. Instead we maintain several parallel virtually identical systems. If one fails, we can either copy from another or switch.<br />
- we aggregate some data directly on runtime servers, then in intermediate MySQL nodes and finally when building reporting-ready aggregates.<br />
- aggregation is going on incrementally, so reporting delay for end users is below 1 hour<br />
- Peter is right, aggregation in big (but not too big) chunks is much faster<br />
- we use InnoDB for dimensions that fit in memory and MyISAM for aggregates. We consider using InnoDB though. However, MyISAM allows to copy files between servers, that is very convenient sometimes.<br />
- we store raw data just in case but we do not use it. When it gets too big, we just delete it. Raw data required for aggregates processing is truncated daily.<br />
- we do not use a lot of de-normalization, just try to avoid snow-flakes. If we do need de-normalization, we do it on dimension level but keep aggregate tables compact. After all, MySQL is not column oriented storage, so big tables should have the smallest row length possible for faster access.</p>
<p>It is interesting, that we tried to use Oracle for end user reporting server, but now switching back to MySQL. Oracle is too difficult to maintain.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: peter</title>
		<link>http://www.mysqlperformanceblog.com/2008/12/22/high-performance-click-analysis-with-mysql/comment-page-1/#comment-421987</link>
		<dc:creator>peter</dc:creator>
		<pubDate>Wed, 24 Dec 2008 00:57:03 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=562#comment-421987</guid>
		<description>Couple of comments/clarifications from me 

1) Aggregate.  There is really some conflict here - to get the best aggregation speed you need to aggregate chunk at once.  Merging 100000 events at once on aggregation is much faster than processing them one by one but yet you need to be careful how how it affects replication or causes table locks if you happen to use MyISAM.

2) Real time vs delayed. I think for many applications semi-real time is a value and as you mentioned it is possible at the low end.  So it is very handy to use adaptive technologies which can aggregate small batches in a normal case but fail back to larger batches (and so more efficient processing) if it can&#039;t keep up well enough.

3) Denormalization. A  common advice but it makes many people to take it to extreme.  For example if you store top countries for the day you do not have to store strings - ID&#039;s are just fine because the lookup table is small and you do few lookups anyway. The Denormalization needs to be done in a way your queries avoid a lot of random lookups but you need to keep the balance of keeping data compact.

4) MyISAM vs Innodb  - if you have read only data you can often get data clustering with MyISAM too by  ALTER TABLE ... ORDER BY X.  It is not same as with Innodb (data is never stored in clustered index) but can get your IO pretty sequential for your prevailing access type.    Innodb is indeed good as default but tradeoff between space and performance can be important for some applications.  Actually this is where I would watch Maria storage closely. It may be handy even in its &quot;crash safe MyISAM&quot; mode

5) About storing raw data.  Really storing in  MyISAM tables without indexes (say one per day) or later Archive tables can be close to efficiency to storing to the database (MyISAM surely will be bound by sequential disk writes)  - this gives convenience of running SQL queries if you need to without extra step of loading the data.  I also prefer to keep things one table per day or something like it so it is very easy to move things around boxes (physical copy, just as files) and if you use full table scan MERGE TABLES are very efficient.  What is however often a bad choice to have huge single highly indexed table which keeps all events as logs.</description>
		<content:encoded><![CDATA[<p>Couple of comments/clarifications from me </p>
<p>1) Aggregate.  There is really some conflict here &#8211; to get the best aggregation speed you need to aggregate chunk at once.  Merging 100000 events at once on aggregation is much faster than processing them one by one but yet you need to be careful how how it affects replication or causes table locks if you happen to use MyISAM.</p>
<p>2) Real time vs delayed. I think for many applications semi-real time is a value and as you mentioned it is possible at the low end.  So it is very handy to use adaptive technologies which can aggregate small batches in a normal case but fail back to larger batches (and so more efficient processing) if it can&#8217;t keep up well enough.</p>
<p>3) Denormalization. A  common advice but it makes many people to take it to extreme.  For example if you store top countries for the day you do not have to store strings &#8211; ID&#8217;s are just fine because the lookup table is small and you do few lookups anyway. The Denormalization needs to be done in a way your queries avoid a lot of random lookups but you need to keep the balance of keeping data compact.</p>
<p>4) MyISAM vs Innodb  &#8211; if you have read only data you can often get data clustering with MyISAM too by  ALTER TABLE &#8230; ORDER BY X.  It is not same as with Innodb (data is never stored in clustered index) but can get your IO pretty sequential for your prevailing access type.    Innodb is indeed good as default but tradeoff between space and performance can be important for some applications.  Actually this is where I would watch Maria storage closely. It may be handy even in its &#8220;crash safe MyISAM&#8221; mode</p>
<p>5) About storing raw data.  Really storing in  MyISAM tables without indexes (say one per day) or later Archive tables can be close to efficiency to storing to the database (MyISAM surely will be bound by sequential disk writes)  &#8211; this gives convenience of running SQL queries if you need to without extra step of loading the data.  I also prefer to keep things one table per day or something like it so it is very easy to move things around boxes (physical copy, just as files) and if you use full table scan MERGE TABLES are very efficient.  What is however often a bad choice to have huge single highly indexed table which keeps all events as logs.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: http://rpbouman.blogspot.com/</title>
		<link>http://www.mysqlperformanceblog.com/2008/12/22/high-performance-click-analysis-with-mysql/comment-page-1/#comment-421966</link>
		<dc:creator>http://rpbouman.blogspot.com/</dc:creator>
		<pubDate>Tue, 23 Dec 2008 22:52:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=562#comment-421966</guid>
		<description>Hi! 

Baron, this is a great post! Extremely useful information written down very well. Kudos, I&#039;ll be returning to this post many times I&#039;m sure. 

Keep it up,

Roland.</description>
		<content:encoded><![CDATA[<p>Hi! </p>
<p>Baron, this is a great post! Extremely useful information written down very well. Kudos, I&#8217;ll be returning to this post many times I&#8217;m sure. </p>
<p>Keep it up,</p>
<p>Roland.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Golan Zakai</title>
		<link>http://www.mysqlperformanceblog.com/2008/12/22/high-performance-click-analysis-with-mysql/comment-page-1/#comment-421578</link>
		<dc:creator>Golan Zakai</dc:creator>
		<pubDate>Tue, 23 Dec 2008 10:50:44 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=562#comment-421578</guid>
		<description>I Think it&#039;s very nice description of known problem, but the perfect solution for this in my opinion is data mining the raw data from MySQL slaves into OLAP cube and then query the cube from the application, leaving MySQL to preform the mission critical task of the application while the reporting is on different set of OLAP servers.</description>
		<content:encoded><![CDATA[<p>I Think it&#8217;s very nice description of known problem, but the perfect solution for this in my opinion is data mining the raw data from MySQL slaves into OLAP cube and then query the cube from the application, leaving MySQL to preform the mission critical task of the application while the reporting is on different set of OLAP servers.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Baron Schwartz</title>
		<link>http://www.mysqlperformanceblog.com/2008/12/22/high-performance-click-analysis-with-mysql/comment-page-1/#comment-421454</link>
		<dc:creator>Baron Schwartz</dc:creator>
		<pubDate>Tue, 23 Dec 2008 07:17:49 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=562#comment-421454</guid>
		<description>Gil, thanks for that comment.  Indeed your solution had slipped my mind :)</description>
		<content:encoded><![CDATA[<p>Gil, thanks for that comment.  Indeed your solution had slipped my mind <img src='http://www.mysqlperformanceblog.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Gil</title>
		<link>http://www.mysqlperformanceblog.com/2008/12/22/high-performance-click-analysis-with-mysql/comment-page-1/#comment-421397</link>
		<dc:creator>Gil</dc:creator>
		<pubDate>Tue, 23 Dec 2008 06:37:40 +0000</pubDate>
		<guid isPermaLink="false">http://www.mysqlperformanceblog.com/?p=562#comment-421397</guid>
		<description>Baron, we discussed this very same topic recently. If you recall, I was having trouble storing traffic data while simultaneously doing lookups to determine whether a pageview was considered unique. My solution was to use multiple Amazon SimpleDB buckets to store atomic traffic data. By implementing a consistent hashing system I am able to perform super-fast, targeted queries. Finally, the data is periodically aggregated and imported into MySQL, which can then be queried by our application. We can easily scale by allocating more SimpleDB buckets. Just goes to show how it can be helpful to explore other options besides a relational database...</description>
		<content:encoded><![CDATA[<p>Baron, we discussed this very same topic recently. If you recall, I was having trouble storing traffic data while simultaneously doing lookups to determine whether a pageview was considered unique. My solution was to use multiple Amazon SimpleDB buckets to store atomic traffic data. By implementing a consistent hashing system I am able to perform super-fast, targeted queries. Finally, the data is periodically aggregated and imported into MySQL, which can then be queried by our application. We can easily scale by allocating more SimpleDB buckets. Just goes to show how it can be helpful to explore other options besides a relational database&#8230;</p>
]]></content:encoded>
	</item>
</channel>
</rss>
