We have a lot of customers who do click analysis, site analytics, search engine marketing, online advertising, user behavior analysis, and many similar types of work.  The first thing these have in common is that they’re generally some kind of loggable event.

The next characteristic of a lot of these systems (real or planned) is the desire for “real-time” analysis.  Our customers often want their systems to provide the freshest data to their own clients, with no delays.

Finally, the analysis is usually multi-dimensional.  The typical user wants to be able to generate summaries and reports in many different ways on demand, often to support the functionality of the application as well as to provide reports to their clients.  Clicks by day, by customer, top ads by clicks, top ads by click-through ratio, and so on for dozens of different types of slicing and dicing.

And as a result, one of the most common questions we hear is how to build high-performance systems to do this work. Let’s see some ways you can build the functionality you need and get the performance you need. Because I’ve built two such systems to manage online ads through Google Adwords, Yahoo, MSN and others, it’s easy and familiar for me to use the example of search engine marketing. I’ll do that throughout this article.

Requirements

The words “need” and “want” are different.  Do you really need atomic-level data?  Do you really need real-time reporting?  If you do, the problem is much more expensive to solve.

Start with the granularity of your data.  What data do you need to make your business run?  If you can’t get access to the time of day of every click on every ad, will it hamper your ability to measure the ad’s value?  Is it enough to know how many times the ad was clicked each day?  If so, you can roll all those events up into a per-day table.

Next, let’s look at “real-time.”  None of the big three (Google, Yahoo, MSN) provides real-time reporting last time I was involved with them (and I suspect this is still true).  It’s too expensive.  Consider your user expectations.  For most applications I’ve been involved with, having day-old data is adequate, and users don’t expect realtime.  The trick here is that when you start out, realtime is possible because your data is small.  “Hey, we do realtime reporting.  Google doesn’t even do that!  We’re better!” Then you get popular :)  And if you’ve promoted your better-ness in the meantime, you might have to do some awkward backpedaling with customers, who now expect realtime data.  The database giveth, and the database taketh away.

Finally, you should think a lot about how you need to query the data.  It is a hard question to answer, and sometimes I’ve seen it evolve over time, especially as the growing data size forces it to.  This goes back to what data you really need to make your business run.  Anything else is gravy.  If there are nice-to-haves, consider not building them in.  Listen to some talks by 37Signals if you need inspiration to toss things out.  Define the types of queries you absolutely have to have, if possible, and note the ways and types of aggregation (by-ad by-day, for example).

Sometimes I ask a customer “what kinds of queries do you have to run?” and they say “we can’t decide, so we want to just store everything.” If you can’t decide yet, then don’t store everything in the database. Instead, store the source data in some fashion that you can reload later, such as flat files, and build support in the database for one or two capabilities you absolutely need now; then add the rest later, reloading the data if needed.

Aggregate

Aggregation is absolutely key for most people.  There are special cases, and there are ways to do general-purpose work without aggregating (see the section below on technologies), but if you’re doing this with vanilla MySQL, you will need to aggregate your data.

What you want to do is aggregate in ways that optimize the most expensive things you’ll do.  And then, you might super-aggregate too.  For example, if you aggregate by day and then you do a lot of queries over 365-day ranges for year-over-year analysis, aggregate again by month.  Then write your queries to use the most aggregated data possible to save work.

Avoid operations that update huge chunks of aggregated data at once.  Among other things, you’ll make replication lag badly.  More about this later.

Another way to say “aggregate” is to say “pre-compute.” If you have time-critical queries for your app to do its work, can you do the work ahead of time so it’s ready to get when needed? This might or might not be aggregation.

Denormalize

Pre-computing and careful denormalization need to go together.  Figure out what other types of data you’ll need in those aggregate tables, and include columns to support these queries. But beware of denormalizing with character data; try to make your rows fixed-length.

One reason denormalization is important is that nested-loop joins on large data sets are very expensive.  If MySQL supported sort-merge or hash joins, you’d have other possibilities, but it doesn’t, so you want to build your aggregate tables to avoid joins.

Watch Data Types

Does your ad ID look like “8a4dabde-1c82-102c-ab13-0019b984eacd” and is it stored in a VARCHAR(36)?  When tables get big, every byte matters a lot.  Use the smallest data types you can, the simplest character sets you can, and watch out for NULLable columns.  Use smallint unsigned or tinyint unsigned if you can.  You can save very large amounts of space.  Choose primary keys very carefully, especially with InnoDB tables — don’t use GUIDs.  Which brings me to my next point:

Use InnoDB

Assuming that you will use the stock MySQL server, InnoDB is usually your best bet. (Actually, XtraDB might be very interesting for you, but I digress).  Due to the cost of repairing huge MyISAM tables and taking downtime, I would not use MyISAM for anything but read-only tables when things get big.  And even if it’s read-only, there’s still another reason to use InnoDB/XtraDB tables…

Optimize For I/O

It is pretty much inevitable: if you do this kind of data processing in MySQL, you’re going to end up heavily I/O bound.  Listen to any of the talks at past MySQL conferences from people who have built systems like yours, and there’s a fair chance they will talk about how hard they have to work on I/O capacity.

What does this have to do with InnoDB?  Data clustering. InnoDB’s primary keys define the physical order rows are stored in.  That lets you choose which rows are stored close to each other, which is very beneficial in many cases.  Especially on huge tables, it lets you scan portions of a table instead of the whole table if you a) choose your aggregation to match the order of your common queries and b) choose your primary key correctly.

Let’s go back to the ad-by-day table.  If you query date ranges most of the time, you should define the primary key as (day, ad).  Don’t use an auto-increment primary key, and don’t put ad first.  If you put ad first, then you’re going to scan the whole table to query for information about yesterday.  If you put day first, then yesterday will all be stored physically together (within the page — the pages themselves may be widely separated, but that’s another matter).

Don’t Store Non-Aggregated Data

I’ve been talking a lot about aggregated data.  What do you do with the non-aggregated data?  My answer is usually simple: just don’t store it in the database.  Instead, pre-aggregate.  Suppose your data is coming from some Apache log or similar source.  Write a script to rip through the file and parse it 10k lines at a time, aggregating as it goes.  When each chunk is done, make it write out a CSV file and import that with LOAD DATA INFILE.  Keep those big fat log files out of the database.  The database is usually the most expensive and hardest-to-scale component in your system — don’t waste resources.

Another benefit of this is the chance to parallelize.  As you know, MySQL doesn’t do intra-query parallelization, so ETL jobs written to rely on SQL tend to get really bogged down.  In contrast, moving the processing outside the database lets you parallelize trivially.

If you need to analyze the non-aggregated data, you can store it on the filesystem and write custom scripts to do special-purpose tasks on it.  Storing a little meta-data about each file can help a lot.  Store the ranges of values for various attributes, for example; or the presence or absence of values.  You can put these into the database in a little meta-table.  Then your script can figure out which files it can ignore.  What we’re doing here starts to look like a hillbilly version of Infobright, which I’ll talk about later.

Alternately, you can store the atomic data as CSV files and use the CSV engine so you have an SQL interface to it (the meta-tables are still a valid approach here!).  This is an easy way to bypass the hard-to-scale database server for the initial insertion, because you can write CSV files with any programming language.  Naturally, CSV files don’t store as compactly on disk as [Compressed] MyISAM or Archive.

These are just some ideas I’m throwing around — the point is to think outside the box, even to think of things that seem “less advanced” than using a database.

Sharding and Partitioning

Sharding is inevitable if your write workload exceeds the capacity of a single server (or if you’re using replication, the capacity of a single slave). Sharding can also help you avoid massive tables that are too big to maintain. If you know you’ll get there, it can change the lifecycle of your application in advance.

What about partitioning in MySQL 5.1?  I know there are some cases when it can help a lot, and we’ve proven that with our customers.  But you still have to think about how to avoid enormous tables that are hard to maintain, back up, and restore.  And the partitioning functionality is not done yet and not fully integrated into the server, so I expect to find a lot more bugs and annoyances.  There are already inconvenient limitations on some key parts of partitioning, such as maintenance and repair commands, that essentially negate the benefits of partitioning for those operations. An finally, it doesn’t save you from the downtime caused by ALTER TABLE — a typical reason to think about master-master with failover and failback for maintenance. As with anything, it’s a cost-benefit equation. What are your priorities? Choose the solution that meets them.

Be Careful With Data Integrity

When you’re storing several levels of aggregation, and there’s denormalization, you need to be scrupulous about data cleanliness, because it’s really hard to fix things up later.  If your data is coming from a partner site, and you upload bad data there, you’ll be getting bad data back for a long time.  And every time you have some incremental job to update the aggregates, you’re exposed to that bad data again.

Any inconsistencies in the atomic data tend to get magnified as it gets aggregated, because you suddenly have a single row created from many rows, and if the many rows don’t match completely, the single one doesn’t know what data should live in it. And this only gets harder to resolve as you get more levels of aggregations.

Watch Out For The Long Tail

People talk about the long tail and how you can focus on optimizing the short head.  It’s the classic 80-20 rule.  Maybe 80% of your ad impressions are on 20% of your ads!  Hooray!  But don’t forget that if you’re aggregating per-day, an ad that gets a million impressions takes one row, and an ad that gets one impression takes exactly the same: one row.  An impression per day becomes a fixed overhead of storage size.  So, you actually have as many rows as you have unique ads per day.  Viewed this way, suddenly you start to hate the ads that occasionally get an impression.  They’re so wasteful!

It’s easy to flip back and forth between viewpoints on this and get distracted into making a mistake.  Watch out when you do your capacity planning.  Don’t get fooled into calculating the wrong thing.

Be Creative With Table Structures

Suppose you have some yes/no fact about an ad impression, such as whether it was a blue ad (whatever that means.)  You start out with this:

What can we improve here? Especially assuming that there are indexes other than the primary key, we can shrink the primary key’s width:

There are a couple of ways to handle this now. You can have the clicks column record the total, and the blue_clicks column record only blue clicks; to find out non-blue clicks you subtract one from the other. Or you can have the blue clicks and non-blue clicks stored, and to get the totals you add them.

Did this gain us anything? We dropped one column, and we just moved those other values around to store them “next, to in the same row” instead of “below, in the next row.” So we’re storing all the same data, right?

Logically, yes; physically, no. Those values that we pivoted up beside their neighbors will share a set of primary key columns. And not only will every index be a little narrower, the table will now contain only half as many rows. That will make the indexes less than half the size. In real life this technique often makes the table+index much less than half the size. You have to write a little more complex queries, but that’s often justified by a large reduction in table size.

I sort of stumbled upon this idea one day. I have no idea what this technique might be called, so I call it dog-earing the table (somehow the image of putting columns next to each other makes me think of putting cards next to each other and shoving).

Archive

If you don’t need data anymore, move it away or get rid of it.  I wrote a three-part article on data archiving on my own blog a while back.  The benefits of purging and archiving data can be dramatic.

Take It Easy On Replication

Building aggregated tables is hard work for the database server.  If you do it on the master with INSERT..SELECT queries, it will propagate to the slaves and it’ll be hard work there too, assuming you use statement-based replication.

You can save that work by either using MySQL 5.1’s row-based replication, or in MySQL 5.0 and earlier, doing the work on a slave, then piping the results back up to the master with LOAD DATA INFILE, which kind of emulates row-based replication in a way.

When you’re updating big aggregate tables, don’t work with giant chunks of them at once.  If there’s any possible way, do it in manageable bits.  A day at a time, for example.

There are a lot of other ways you can make replication faster.  I wrote a lot about this in our book, which is linked from the sidebar above.

Don’t Assume Traditional Methods Will Save You

What you’re really doing here is building a data warehouse.  So you may think you should use traditional DW methods, like star schemas.  The problem is that MySQL doesn’t tend to perform well on a data warehousing workload.  The nested-loop joins are not all that fast on big joins; the query optimizer can sometimes pick bad plans when you have a lot of joins between fact and dimension tables, and so on.  With careful tweaking, many of these things can be overcome, but how much time do you have?  And the gains are simply limited by some of MySQL’s weaknesses in some cases.

Not only that, but star schemas are not intended to be fast. The star schema is essentially “I admit defeat and accept table scans as a fact of life.” Table scans can be better than the alternative, if the alternatives are limited, but they’re still not what you need unless you’re okay with long queries that read a lot of rows — MySQL can’t handle too many of those at once.

Aside from star schemas, another tactic I see people try a lot is to build “flexible schemas” with tables that contain name-value pairs or something similar. The thought is that you can make the application believe it has a custom table, which is really constructed behind the scenes from the name-value tables in a complex query with many joins. I have never seen this approach scale well.

Use The Best Technologies You Can

MySQL is not the end-all and be-all.  If you’re familiar with it and it can serve you reasonably well, it’s fine to use it for things that it’s not 100% optimal for.  But if the costs of doing that are going to outweigh the costs of using another solution, then look at other solutions.

One that holds promise is Infobright.  While I have not evaluated their technology in depth, I think it merits a good look.  I had the chance at OpenSQL Camp to talk to Alex Esterkin and see him present on it, and based on that exposure, I think they are doing a lot of things right.  When I know enough to have a real opinion (or when other Percona people get to it before I do!) you’ll see results on this blog.

Another is Kickfire — also something I have not had a chance to properly evaluate.  And there are others, and there will continue to be more. Finally, PostgreSQL is clearly better for some workloads out-of-the-box than MySQL is, especially for more complex queries. Percona is not tied to MySQL, although we’re most famous for our knowledge about it.  When another tool is the right one, we use it.

Have you thought about using something besides a database?  You have your choice of buzzwords these days.  Hadoop is a big one.  But beware of falling into the trap of brute-forcing a solution that really needs to be solved with intelligent engineering, instead of massive resources.

Conclusion

This article has been an overview of some of the tactics I’ve used to successfully scale large click-processing and other types of event-analysis databases. In some cases I’ve been able to avoid sharding for a long time and run on many fewer disk drives with much less memory, or even with 10-15x fewer servers. Clever application design, and a holistic approach, are absolutely necessary. You can’t look to the database to solve everything — you have to give it all the help you can. Hopefully it’s useful to you, too!

13 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Gil

Baron, we discussed this very same topic recently. If you recall, I was having trouble storing traffic data while simultaneously doing lookups to determine whether a pageview was considered unique. My solution was to use multiple Amazon SimpleDB buckets to store atomic traffic data. By implementing a consistent hashing system I am able to perform super-fast, targeted queries. Finally, the data is periodically aggregated and imported into MySQL, which can then be queried by our application. We can easily scale by allocating more SimpleDB buckets. Just goes to show how it can be helpful to explore other options besides a relational database…

Golan Zakai

I Think it’s very nice description of known problem, but the perfect solution for this in my opinion is data mining the raw data from MySQL slaves into OLAP cube and then query the cube from the application, leaving MySQL to preform the mission critical task of the application while the reporting is on different set of OLAP servers.

Hi!

Baron, this is a great post! Extremely useful information written down very well. Kudos, I’ll be returning to this post many times I’m sure.

Keep it up,

Roland.

Peter Zaitsev

Couple of comments/clarifications from me

1) Aggregate. There is really some conflict here – to get the best aggregation speed you need to aggregate chunk at once. Merging 100000 events at once on aggregation is much faster than processing them one by one but yet you need to be careful how how it affects replication or causes table locks if you happen to use MyISAM.

2) Real time vs delayed. I think for many applications semi-real time is a value and as you mentioned it is possible at the low end. So it is very handy to use adaptive technologies which can aggregate small batches in a normal case but fail back to larger batches (and so more efficient processing) if it can’t keep up well enough.

3) Denormalization. A common advice but it makes many people to take it to extreme. For example if you store top countries for the day you do not have to store strings – ID’s are just fine because the lookup table is small and you do few lookups anyway. The Denormalization needs to be done in a way your queries avoid a lot of random lookups but you need to keep the balance of keeping data compact.

4) MyISAM vs Innodb – if you have read only data you can often get data clustering with MyISAM too by ALTER TABLE … ORDER BY X. It is not same as with Innodb (data is never stored in clustered index) but can get your IO pretty sequential for your prevailing access type. Innodb is indeed good as default but tradeoff between space and performance can be important for some applications. Actually this is where I would watch Maria storage closely. It may be handy even in its “crash safe MyISAM” mode

5) About storing raw data. Really storing in MyISAM tables without indexes (say one per day) or later Archive tables can be close to efficiency to storing to the database (MyISAM surely will be bound by sequential disk writes) – this gives convenience of running SQL queries if you need to without extra step of loading the data. I also prefer to keep things one table per day or something like it so it is very easy to move things around boxes (physical copy, just as files) and if you use full table scan MERGE TABLES are very efficient. What is however often a bad choice to have huge single highly indexed table which keeps all events as logs.

alz

Thanks, very interesting. I would say it is “must read” for anybody who starts doing that sort of things.

We use MySQL based reporting for our internal ad serving analysis and “discovered” most of your recommendations ourselves. Our solution has several layers of aggregation and clustering but if finally ends in one or multiple MySQL servers used for end user reporting. Works fine and scales good.

Some ideas from our experience:
– we do not use replication. Instead we maintain several parallel virtually identical systems. If one fails, we can either copy from another or switch.
– we aggregate some data directly on runtime servers, then in intermediate MySQL nodes and finally when building reporting-ready aggregates.
– aggregation is going on incrementally, so reporting delay for end users is below 1 hour
– Peter is right, aggregation in big (but not too big) chunks is much faster
– we use InnoDB for dimensions that fit in memory and MyISAM for aggregates. We consider using InnoDB though. However, MyISAM allows to copy files between servers, that is very convenient sometimes.
– we store raw data just in case but we do not use it. When it gets too big, we just delete it. Raw data required for aggregates processing is truncated daily.
– we do not use a lot of de-normalization, just try to avoid snow-flakes. If we do need de-normalization, we do it on dimension level but keep aggregate tables compact. After all, MySQL is not column oriented storage, so big tables should have the smallest row length possible for faster access.

It is interesting, that we tried to use Oracle for end user reporting server, but now switching back to MySQL. Oracle is too difficult to maintain.

Victoria Eastwood

Hi Baron,

Very nice blog entry and thanks for the favorable mention of Infobright. We would be more than happy to get you or one of your team started with Infobright. Since Infobright is column oriented and sports very high compression, you will likely find that some of your guidance will change.

Cheers

Yoav Shapira

Thanks for this great article. A lot of insightful comments. Much appreciated!

Do you have a recommendation for the multitenant scenario where different tenants have different custom fields? A name-value pair approach comes to mind, but you mention you’ve never seen it scale well. What are decent alternatives?

Yoav Shapira

Thanks for this great article. A lot of insightful comments. Much appreciated!

Do you have a recommendation for the multitenant scenario where different tenants have different custom fields? A name-value pair approach comes to mind, but you mention you’ve never seen it scale well. What are decent alternatives?