There are interesting posts these day about future of MySQL Replication by Brian Frank and Arjen.

I think it very interesting to take a look at a bit bigger picture using framework from Innovators Dilemma and Innovators Solution.

I’m not going to speak about disruption and commoditisation of Database Market, leaving this for Market talks, my interest is market of Web Applications in general.

Clearly web is not enterprise and has a lot of different properties – Top web sites like Google, Yahoo, FaceBook have to have to provide online service to tens and hundreds of millions of users, having qucikly changing applications which have to be deployed with relatively tight budget. Especially if you think about it as $/visitor or $revenue/server the difference for most enterprises vs most of web applications in insane.
Not only that but many of these systems were started by beginners so traditional databases were way to complex to use too.

All these requirements made traditional databases irrelevant for many web properties – too complex and too expensive to start with.

In second half of 90s when MySQL appeared on the market many web applications were simple and often not using database at all. MySQL did not have replication or transactions at that time but it was easy to understand fast and free and this is what a lot of people looked for that days.

Over last 10 years Web applications are changed and so are their demands to data storage – now web applications have to handle much higher load and much larger data sets, they are also much more complicated and performance of MySQL is often the problem even for medium scale sites.

As response to these things few things happen – MySQL kept being “simple” and introduced relatively simple performance boosters – Query Cache and Replication, which were clearly not enough to solve all performance problems (like scaling Writes with replication or dealing with large data sets). Web crowd simplicity and really had a little choice so most of them instead of “upgrading” to Oracle or other systems which offer more performance features out of the box started deploying their custom solutions – Sharding and Caching. Indeed Memcache is the most known caching technologies today, at large extent because Brad opensourced it rather than keeping it in house. There were a lot of custom technologies created for caching or scalability issues MySQL could not solve well. There are inhouse and opensource solutions for tasks for large scale file storage, queuing, data processing etc.

But seriously if you look at this – people do not really enjoy adding memcached – cache handling with all its invalidation and consistence issues is not fun. Neither they like sharding with requirement to split the data in defined way so cross shard queries become a pain. Dealing with lag in MySQL Replication is another issue which complicates application development. All of it is not fun – developers had to do this because there is no product out where which would allow them to build their application in easy way not dealing with all those issues. And I’m not speaking about just database here, but rather whole stack to allow to build scalable web applications in an easy way.

Customers are constantly asking me if there is something which would help them to scale MySQL and get some HA out of the box even on the medium level. Seriously – MySQL Cluster, Continuent, Master-Master Replication, DRBD or SAN based HA architectures all have their limits which makes neither of them used for very wide class of applications.

Another interesting trend which is happening is Web is getting more enterprise like. After few years of geeky startup growth many companies get sold or otherwise become structured in more enterprise fashion and become thinking more enterprise wise – they may want more packaged solutions rather than custom architectures, they also may not enjoy running too many servers. For others it becomes space and power constraint.

It is also worth to mention Web vs Open Source component here. I think there is an interesting split – some companies I talk to are committed to “no vendor lock in” policy and would not like proprietary solutions, another – typically ones having already some of these system in house do not really care and would go with commercial solution if it solves their pains.

So what do we see as response to these requirements ?

From MySQL side we have further development of MySQL Cluster to be more usable for web apps as well as development of MySQL Proxy to help with sharding or dealing with use not fully up to date slaves.

Some innovation is coming from third party vendors – InfoBright and NitroDB presented Storage Engines targeting certain workloads. PrimeBase is working on scalable blog streaming to make it possible store large blobs such as images in the database efficiently.

The other Wave is appliances – you can see Violin Memory appliance which can be used with MySQL to get very fast IO and so consolidate system suffering from IO bound workloads. There is KickFire appliance around the corner which more focuses on CPU bound complex queries and there are more in development.

Though I think most interesting development when it comes from Web Apps come from another side and is abut not SQL and non relational data processing and storage system – BigTable with MapReduce, Amazon Dynamo and SimpleDB, Hadoop

Another angle of customization of data store and processing was using non SQL and not systems – Google Big Table, Amazon Dynamo and SimpleDB, Hadoop, CouchDB which comes computed with Cloud Computing and dealing with geographically distributed very large scale systems.

This is the area where I would expect next big innovation to happen, when it comes to Web applications. Web applications concepts operate with concepts which are not very efficiently handled with SQL and relational operations (think social graph or permissions).

I would expect MySQL to continue to drill into Enterprise market during next few years but Web Applications starting to more and more relay on alternative systems for data storage and management (well Google does it already).

I also think the piece which is missing now is not the database but rather concept and platform – developers do not want to care about database and caching they just want their application to be quickly developed and scale well. So what we need is some kind of breed between Ruby on Rails in terms of getting it up and running fast and Scalability on the scale of Big Table with Map Reduce.

Indeed I think MySQL Replication usage will reduce, but I would not expect Memcached to be leading pushing factor.

29 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
mike

This is an -awesome- summary of what is going on. Bringing out all the names we know and love, and the issues that still aren’t easily resolved.

I just saw this pop up on my feeds – you speak about MySQL Proxy helping out, well this could make sharding transparent perhaps 🙂

http://pero.blogs.aprilmayjune.org/2008/04/10/hscale-01-released-partitioning-using-mysql-proxy/

Push sharding from the application layer to a transparent DB proxy… hell yeah!

mike

Oh, of course. I wasn’t saying it was foolproof. It is a neat idea though.

It seems like the most successful systems (to me) are ones that are shared-nothing and can scale horizontally just by adding a node (Google’s neverending architecture, Amazon’s crazy architecture, LiveJournal’s memcache/mogilefs, etc.)

As the Amazon Dynamo document says (I’ll quote it wrong) “the architecture should be designed with the knowledge that systems will go up and down at any given time, and should handle the issues transparently”

Memcache allows that. MogileFS allows that. CouchDB allows that. etc. Distributed architectures seem to be key. Replication was nice, but it just won’t scale infinitely. Not to say everything will, but it seems like being able to add a layer of abstraction in front of the storage (which I suppose MySQL Cluster may have some of this built in) might be part of the key. Of course, the middle-tier has to be able to join, sort, limit, etc. data sets from multiple physical locations… that will grow into a monster of its own.

Ideally, I want a bunch of components I can install ad-hoc on as many nodes as I want (in any combination of components on each physical system) and have it just run as soon as I inform the apps it is available (bonus if it doesn’t require that…) – we’re slowly evolving there, some technologies are already close enough it seems, but the RDBMS options still aren’t (except maybe some $$$ solutions, but we’re talking open source here!)

Jonathan

“So what we need is some kind of breed between Ruby on Rails in terms of getting it up and running fast and Scalability on the scale of Big Table with Map Reduce”

So basically is the new google appengine will use ruby on rails instead of python, then mysql is done for?

Well.. it will be interesting to see what will happen in the not-so-far future.

Brian Aker

Hi Peter!

I would agree that memcached is a roll your own solution. We have started the work of adding callback support for read through caches to solve this problem in libmemcached (aka read entirely through memcached). We will be adding write through cache pieces of the next month.

Once that is done… we just use memcached and we can ignore what the storage piece behind it is.

For AppEngine/SimpleDB. I believe either CouchDB, or one of us just starting to implement a “cloud” like service which will be installable will be able to keep people from being tied to one of the big services.

Cheers,
-Brian

mike

re: online backup services

I’m guessing it was marked as spam, but my RSS reader caught it.

I recommend offsite backup solutions, but that’s AFTER the data is backed up locally. We still need an easy way to get that backup done locally, -then- we can shop around for services.

Die spammers 🙂

Peter Romianowski

Peter + mike,

the HSCALE project is not designed to be a “plug-in and forget” sharding solution. It is clearly meant to provide a robust and fast system (as fast as you can get with the MySQL Proxy aproach – think of added network latency). So we concentrate on a particular use case – having partitions based on a single column of a table. That matches our database design – and perhaps other’s. 😉
At the end you will have to rewrite parts of your application / SQL. But you don’t have to write a whole partition aware layer into your application.

Currently we have our application running with HSCALE in our testing environment. (MySQL Proxy is really awesome, fast and robust btw.).

I think (and hope) that all of this will be more than just a neat proof of concept 😉 But I agree with you, Peter, there is a lot of work to do (as with every sharding solution).

Peter

Brian Aker

Hi Peter!

A lot of the 5% tuning stuff today should go away. Build applications for specific areas, but don’t worry about the incremental tuning bits. I would rather describe best practices and explain how to deploy across multiple servers.

We do seem to be entering an age where low hanging fruit can just run in environments, while the rest is infrastructure work that has to be deployed.

Cheers,
-Brian

Clint Byrum

Brian thats funny, I was just discussing yesterday how it would be awesome if there were something like memcached, only write-through rather than write-back. I will have to keep tabs on your progress. 🙂

I think the early on geeky shoestring budget startups had a lot of systems-oriented people working on things. Yahoo and google’s look and feel are sort of reminiscent of that. You had guys who were able to understand the impact of everything they were doing on their systems, but didn’t really know what average users wanted or thought. You had your CEO writing code and making decisions like “users won’t care if their posts take 5 minutes to show up”.

Now everybody is on the net so you have users demanding easier to use, more feature rich and robust applications. You have product managers making feature decisions, and developers with no stake in companies implementing them without knowledge of costs. So scalability needs to be transparent or it won’t really happen at all in that sort of environment.

Andrew

I spend a lot of time writing Python that checks memcached and then does a MySQL query if the data is not found. I also have to make sure that every update expires the memcached cache.

I think that memcached is a create technology but it just doesn’t seem right. I shouldn’t have to write all of this extra application code to get more performance from my database.

Like many Web 2.0 sites I have a very limited number of queries so I don’t need (or want) a sophisticated database, I just want it to be very, very fast!

mike

I find a properly designed (from the start) site can easily implement memcached without much effort.

Proper get/set type functions for everything you cache, and ensure all data access/modification is ran through that, and you’re golden.

Brian Aker

Peter, per your comment to Mike… use UDF against tables to remove stale data from the cache. AKA… force memcached to purge its data based on row updates in the database. If you have designed your site well then you will have a cache miss in you cache which will cause a rebuild of the object from the database.

Cheers,
-Brian

Brian Aker

Hi Peter!

Put a trigger on that table such that select memc_delete(b.somecol) from table a, b WHERE a.id = b.id etc.. just hook the dependency with a trigger to “touch” the rows in the other table.

Cheers,
-Brian

mike

This is how I manage it right now.

To build collections, I select only the IDs, based on filtering, sorting/ordering, LIMIT, etc. I build an array based on that. I go directly to the database for that.

while(mysql results) {
$fetch[] = $id;
}

Then feed $fetch to a cache_get() function, which is basically a generic wrapper for memcache_get().

I can’t speak for how massively this scales, but allows us to go to the database only for ordering and other lower (or higher?) level functions.

I will say that the memcache UDF is a cool idea, and would probably make this additional step no longer required, but that still requires a connection through MySQL to the memcache layer. I like keeping MySQL out of the loop as much as possible, to save it more complex work… the above example should save a lot of data transfer and hopefully keep the pipe on the server open for more requests. I suppose just using memcache UDF for cache invalidation/reset, set, and the first get (to warm that cache key) would help though.

mike

Understood.

The idea was that if we build the list of IDs, then we can call generic accessor functions to get the data, and that is an abstraction layer in itself. It will check the cache, hit the db if needed, etc. – the consumer is none the wiser.

I also like this approach because it can enable sharding easily by adding one more check at the top of the function to determine the appropriate server/location of that data. Then use memcache to glue it together in the cache layer so next time it doesn’t require hitting any of the shards. Hope that makes sense.

I assumed that the foundation inside of the generic functions may change, but the generic functions themselves can be called the same way in the app; the only changes will need to be in the accessor functions for the new data storage architecture… which could be sharding, additional forms of caching, a swap out to another RDBMS, moving to something like a web service or a fully SOA framework of clusters of servers each owning specific services and piecing them together on each request (much like I believe Amazon’s e-Commerce system works)

mike

Side note: just found this link on memcached FAQ, and it’s exactly what I mentioned:
http://lists.danga.com/pipermail/memcached/2007-July/004578.html

Two-phase fetch – get IDs first (to build the list), then get the rest later

roj

Have you folks seen Repcached? Looks really neat with multi-master memcached replication?

http://repcached.sourceforge.net/

Thoughts on this?

Vlad

CouchDB looks like a great thing. One can even eliminate web server putting all webgui on Javasacript.