October 30, 2006

Using Sphinx as MySQL data retrieval accelerator

Posted by peter

I’ve run into the following thread couple of days ago:

Basically someone is using sphinx to perform search simply on attributes (date, group etc) and get sorted result set and claiming it is way faster than getting it with MySQL. Honestly I can well believe it for cases when you want to know number of matching rows as well as if you can’t build efficient indexes so selectivity is done by index and index used to resolve order by.

Funny enough to filter by attributes or sort sphinx does not use indexes - indexes are only used for full text search matching, but it is still extremely fast doing data crunching.

I just tested right now performing search of “the” which matched 100.000.000 of documents out of 200.000.000 collection (200GB) completed in 0.7 second. This is system we’re building for one of our clients which uses cluster of 3 nodes to handle search. In this case no shortcuts are taken all 100.000.000 of matching document are traversed and priority queue sorting is performed to generate 1.000 best matching results. Quite impressive

Yeah I know it should be stop word but I currently have index without stop words for testing purposes.

Now what I’m hoping for as developments:

Andrew to continue improving sphinx so it would have more advanced filtering clauses and types of attributes, plus there would be an option to retrieve by filters only with no full text query. Sphinx should not be replacement for Database Server but for many data retrieval needs it will work great. Especially as it can be used with other databases which may be slower than MySQL.

MySQL Make it so one would not need to use sphinx to get great performance for such kind of queries. This includes parallel processing, fast count(*) and bitmap indexes to help non selective clauses. Also some form of fast sort like priority queue could be used if only few first elements are needed.

October 28, 2006

Wishes for new “Pure PHP” MySQL driver

Posted by peter

If you’re following MySQL or PHP landscape you should have seen announcement by MySQL to develop pure PHP driver. If not - Here is FAQ .

I’m to meet the team (Georg, Andrey etc) which will be developing this driver during my visit to Open Source Database Conference in November so I thought it would be good idea to gather some wish list for things nice to have in this new driver. Below is my list and I would appreciate to hear your ideas.

Build In Profiling I would like to see how many connections and queries page generated and how long they took - I will place this information in the log. For debugging I’d like to be able to get a table below page output listing all the queries, their exec times and number of rows they are returned. So far it has to be done in inherited class.

Auto Explain Would be very helpful for development to run EXPLAIN on the queries and warn if these is something obviously wrong.

Emulated Prepared Statements I would like to have easy time to switch from PS to standard statements to check bugs and performance difference without changing my application. Also there are bunch of statements which can’t be prepared - I would like library to take care of emulating these for me.

Arrays in Prepared Statements Dealing with certain types of queries, ie queries with IN is so painful with prepared statements - I would like to be able to simply able to pass data array and let library to take care about emulating.

Cached Prepared Statements It would be great if library could cache them so you do not have to worry of tracking same statements in your application and caching them.

Persistent Connections Ajax applications with frequent case of one query per request make persistent connections relevant again.

Automatic load balancing and fall back I would like to be able to pass list of servers system can connect to and let it handle load balancing and fall back by itself, similar as JDBC driver does.

Client size caching Would be very helpful for simple applications which may run query several times per page. I however would like to go further and have TTL based client cache with dynamic back end. I would expect many people would like to see memcached hooked up to it to share result sets among web servers for result sets which can be a bit stale. This might not be the most optimal way to cache things but it would be very easy for many applications.

Query Timeouts Make it easy to specify timeouts for connects and query execution so page would not timeout but you would rather be able to respond with “system overloaded try again later” message.

Exceptions This should be optional as not everyone likes them but would be very helpful in many cases.

Multiple concurrent queries I would like to be able to run multiple queries for multiple MySQL connections at once. For scale out applications data to generate the page may be on many different servers and it is latency killer to be required to query all of them sequentially instead of doing it in parallel. Now people have to do crazy things to work it around.

There are just few thing from the top of my head. There are probably some I forgot about at this point :)

October 26, 2006

Speaking on OpenSource Database Conference, Frankfurt

Posted by peter

I’ll have two sessions on upcoming OpenSource Database Conference in Frankfurt 6-8 November. One session will be general MySQL Performance Optimization workshop the other will be focused on Innodb architecture and optimization.

If you’re visiting this event or International PHP Conference which runs parallel to this even drop me a note and we can chat.

You also might noticed I was not posting too actively in October - it turned to be very busy month but hopefully I’ll get some more time soon :)

October 16, 2006

Should MySQL and Web Server share the same box ?

Posted by peter

This is interesting question which I thought it would be good to write about. There are obviously benefits and drawbacks for each of methods.

Smaller applications usually start with single server which has both MySQL and Web server on it. In this case it is not usually the question but once application growths larger and you need to have multiple servers you may decide ether to grow system in MySQL+Apache pairs or split MySQL And Web Server and place them on different boxes.

Generally using separate boxes for MySQL and Web Servers is rather good practice.

It is more secure - Compromising your web server does not directly give access to your database, even though most applications have enough database access permissions to be allow intruder to trash/dump data.

It is easier to analyze - Troubleshooting bottlenecks on shared boxes is more complicated compared to systems running only MySQL or only Web server. In this case you already know who is troublemaker by simply looking at system wide stats.

Easier to maintain - Same thing if box happens to run multiple things it is harder to maintain. I would not call the difference significant in this case though.

Easier to balance - Lets say you have Web application and just added some new feature, ie chat application which increases load on your web server but does not really affect database part of load. If you can operate database farm and web server farms separately you can simply increase number of web servers.

It is less expensive - You typically want database boxes to be secure, using good hardware with ECC memory to avoid database corruption, use RAID to avoid loosing database with any hard drive loss etc. Database boxes also generally require more monitoring and maintainence such as backups so you end up using some serious hardware for this boxes to keep their
number manageable. With Web boxes it is different - you’re quite OK using crappy hardware for them as all you need is CPU power. If box starts to misbehave it is easy to shut it down without affecting site operations. Also you rarely would have data corruption due to web boxes memory failure, more likely you’ll have web server crashes and this sort of things. You can ether clone web servers from template hard drive or even have them disk less booting by NFS.

So if using dedicated boxes is so great why to think about sharing MySQL and Web server at all ? Well mostly it is for cheap guys.
In many applications you will find database servers to be IO bound so CPUs are doing virtually nothing and you’re wasting resources. This is the reason for some cheap environments to have Web servers also on database boxes, might be only handling partial load etc.

I would however only use it in last resort - placing some data crunching scripts on database server is often better use of their free CPU time.

Second thing you may feel bad about it is Web Servers memory. Getting certain amount of memory is pretty cheap, ie 4GB of memory per box costs very close to 2GB, while jump from 16GB to 32GB may be much more expensive (even in price by GB).
So you can get Web boxes with relatively plenty of memory cheap but unless you’re running 500 Apache children with mod_P (php,perl,python) per box (which is probably bad idea anyway).

The good use for such extra memory is probably caching - Web page caching, if you do not have separate layer for it, local memory or cacheing type of caching (depending on your application needs) is very good idea.

One more benefit of local access to MySQL is latency. This was problem many years ago with 10Mbit network but with 1Gbit networks being commodity these days you should not worry too much about it, unless you have each page generated by 1000+ queries, which is bad idea already.

One case I should mention when shared MySQL and Web server makes sense is Web Services architecture when you can have certain boxes providing you with some simple “Services” - these could be small enough to be single shared box (or pair of shared boxes for HA). In such cases I would think about Web Server mainly being provider of different protocol to access your data - it is typically simple and would not require much of CPU and other resources itself.

For example you can see this “Shared” kind of architecture in CNET systems using ATOMICS component to talk to MySQL over HTTP. (not what I’m great fan of this idea though).

October 8, 2006

Small things are better

Posted by peter

Yesterday I had fun time repairing 1.5Tb ext3 partition, containing many millions of files. Of course it should have never happened - this was decent PowerEdge 2850 box with RAID volume, ECC memory and reliable CentOS 4.4 distribution but still it did. We had “journal failed” message in kernel log and filesystem needed to be checked and repaired even though it is journaling file system which should not need checks in normal use, even in case of power failures. Checking and repairing took many hours especially as automatic check on boot failed and had to be manually restarted.

Same may happen with Innodb tables. They are designed to never crash, surviving power failures and even partial page writes but still they can get corrupted because of MySQL bugs, OS Bugs or hardware bugs, misconfiguration or failures.
[read more...]

October 3, 2006

MySQL Optimizer and Innodb Primary Key

Posted by peter

Innodb primary key is special in many senses and I was always wondering how well MySQL is integrated with Innodb to take advantage of these special features.

Lets see which things work and which things do not:
[read more...]

Long PRIMARY KEY for Innodb tables

Posted by peter

I’ve written and spoke a lot about using short PRIMARY KEYs with Innodb tables due to the fact all other key will refer to the rows by primary key. I also recommended to use sequential primary keys so you do not end up having random primary key BTREE updates which can be very expensive.

Today I would like to share practical example when you may use long primary key with value distribution far from sequential.
[read more...]

October 1, 2006

Backport of micro-time patch to mysql 4.1

Posted by Vadim

Taking into account 4.1 tree is still popular and is used on many production servers we
decided to make backport of patch to slow-log queries. The patch allows to specify time of slow queries in microseconds and is very helpful in a fight with problematic queries. 4.1 Patch is available here (The original patch was developed by Georg Richter georg at php net)