Still stuck? Get the expert-level support you need.

If you’ve been reading enough database-related forums, mailing lists, or blogs you have probably heard complains about MySQL being unable to handle more than 1,000,000 (or select any other number) rows by some of the users. On the other hand, it is well known with customers like Google, Yahoo, LiveJournal, and Technorati, MySQL has installations with many billions of rows and delivers great performance. What could be the reason?

The reason is normally table design and understanding the inner works of MySQL. If you design your data wisely, considering what MySQL can do and what it can’t, you will get great performance. And if not, you might become upset and become one of those bloggers. Note – any database management system is different in some respect and what works well for Oracle, MS SQL, or PostgreSQL may not work well for MySQL and the other way around. Even storage engines have very important differences which can affect performance dramatically.

The three main issues you should be concerned if you’re dealing with very large data sets are Buffers, Indexes, and Joins.

Buffers

First thing you need to take into account is fact; a situation when data fits in memory and when it does not are very different. If you started from in-memory data size and expect gradual performance decrease as the database size grows, you may be surprised by a severe drop in performance. This especially applies to index lookups and joins which we cover later. As everything usually slows down a lot once it does not fit in memory, the good solution is to make sure your data fits in memory as well as possible. This could be done by data partitioning (i.e. old and rarely accessed data stored in different servers), multi-server partitioning to use combined memory, and a lot of other techniques which I should cover at some later time.

So you understand how much having data in memory changes things, here is a small example with numbers. If you have your data fully in memory you could perform over 300,000 random lookups per second from a single thread, depending on system and table structure. Now if your data is fully on disk (both data and index) you would need 2+ IOs to retrieve the row – which means you get about 100 rows/sec. Note: multiple drives do not really help a lot as we’re speaking about single thread/query here. So the difference is 3,000x! It might be a bit too much as there are few completely uncached workloads, but 100+ times difference is quite frequent.

Indexes

What everyone knows about indexes is the fact that they are good to speed up access to the database. Some people would also remember if indexes are helpful or not depends on index selectivity – how large the proportion of rows match to a particular index value or range. What is often forgotten about is,  depending on if the workload is cached or not,  different selectivity might show benefit from using indexes. In fact, even MySQL optimizer currently does not take it into account. For in-memory workload indexes, access might be faster even if 50% of rows are accessed, while for disk IO bound access we might be better off doing a full table scan even if only a few percent or rows are accessed.

Let’s do some computations again. Consider a table which has 100-byte rows. With decent SCSI drives, we can get 100MB/sec read speed which gives us about 1,000,000 rows per second for fully sequential access, with jam-packed rows – quite possibly a scenario for MyISAM tables. Now if we take the same hard drive for a fully IO-bound workload, it will be able to provide just 100 row lookups by index per second. The difference is 10,000 times for our worst-case scenario. It might be not that bad in practice, but again, it is not hard to reach 100 times difference.

Here is a little illustration I’ve created of the table with over 30 millions of rows. “val” column in this table has 10000 distinct value, so range 1..100 selects about 1% of the table. The times for full table scan vs range scan by index:

Also, remember – not all indexes are created equal. Some indexes may be placed in a sorted way or pages placed in random places – this may affect index scan/range scan speed dramatically. The rows referenced by indexes also could be located sequentially or require random IO if index ranges are scanned. There are also clustered keys in Innodb which combine index access with data access, saving you IO for completely disk-bound workloads.

There are certain optimizations in the works which would improve the performance of index accesses/index scans. For example, retrieving index values first and then accessing rows in sorted order can be a lot of help for big scans. This will reduce the gap, but I doubt it will be closed.

Joins

Joins are used to compose the complex object which was previously normalized to several tables, or perform complex queries finding relationships between objects. Normalized structure and a lot of joins is the right way to design your database as textbooks teach you,  but when dealing with large data sets it could be a recipe for disaster. The problem is not the data size; normalized data normally becomes smaller, but a dramatically increased number of index lookups could be random accesses. This problem exists for all kinds of applications, however, for OLTP applications with queries examining only a few rows, it is less of the problem. Data retrieval, search, DSS, business intelligence applications which need to analyze a lot of rows run aggregates, etc., is when this problem is the most dramatic.

Some joins are also better than others. For example, if you have a star join with dimension tables being small, it would not slow things down too much. On the other hand, a join of a few large tables, which is completely disk-bound, can be very slow.

One of the reasons elevating this problem in MySQL is a lack of advanced join methods at this point (the work is on a way) – MySQL can’t do hash join or sort-merge join – it only can do nested loops method, which requires a lot of index lookups which may be random.

Here is a good example. As we saw my 30mil rows (12GB) table was scanned in less than 5 minutes. Now if we would do eq join of the table to other 30mil rows table, it will be completely random. We’ll need to perform 30 million random row reads, which gives us 300,000 seconds with 100 rows/sec rate. So we would go from 5 minutes to almost 4 days if we need to do the join. Some people assume join would be close to two full table scans (as 60mil of rows need to be read) – but this is way wrong.

Do not take me as going against normalization or joins. It is a great principle and should be used when possible. Just do not forget about the performance implications designed into the system and do not expect joins to be free.

Finally I should mention one more MySQL limitation which requires you to be extra careful working with large data sets. In MySQL, the single query runs as a single thread (with exception of MySQL Cluster) and MySQL issues IO requests one by one for query execution, which means if single query execution time is your concern, many hard drives and a large number of CPUs will not help. Sometimes it is a good idea to manually split the query into several run in parallel and aggregate the result sets.

So if you’re dealing with large data sets and complex queries here are few tips.

Try to fit data set you’re working with in memory – Processing in memory is so much faster and you have a whole bunch of problems solved just doing so. Use multiple servers to host portions of the data set. Store a portion of data you’re going to work with in temporary tables etc.

Prefer full table scans to index accesses – For large data sets, full table scans are often faster than range scans and other types of index lookups. Even if you look at 1% fr rows or less, a full table scan may be faster.

Avoid joins to large tables Joining of large data sets using nested loops is very expensive. Try to avoid it. Joins to smaller tables is OK but you might want to preload them to memory before join so there is no random IO needed to populate the caches.

With proper application architecture and table design, you can build applications operating with very large data sets based on MySQL.

More free resources that you might find useful

Webinars

Blog Posts

White Papers & eBooks

Learn more about Percona Server for MySQL

144 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Ian Holsman

Good post Peter..

just a couple of questions to clarify somethings.

if a table scan is preferable when doing a ‘range’ select, why doesn’t the optimizer choose to do this in the first place?

is there some sort of rule of thumb here.. use a index when you expect your queries to only return X% of data back? otherwise put a hint in your SQL to force a table scan ?

what changes are in 5.1 which change how the optimzer parses queries..

does running optimize table regularly help in these situtations? or just when you have a large change in your data distribution in your table?

oh.. one tip for your readers.. always run ‘explain’ on a fully loaded database to make sure your indexes are being used

Vadim Tkachenko

Ian,
as I wrote in http://www.mysqlperformanceblog.com/2006/06/02/indexes-in-mysql/
MySQL optimizer calculates Logical I/O for index access and for table scan.
One big mistake here, I think, MySQL makes assumption 100 key comparison
like ” if (searched_key == current_key)” is equal to 1 Logical I/O.
But I believe on modern boxes constant 100 should be much bigger.

Alexey

For most workloads you’ll always want to provide enough memory to key cache so its hit ratio is like 99.9%. Even if a table scan looks faster than index access on a cold-cache benchmark, it doesn’t mean that it’s a good idea to use table scans. It can easily hurt overall system performance – by trashing OS disk cache, and if we compare table scan on data cached by OS and index scan on keys cached by MySQL, table scan uses more CPU (because of syscall overhead and possible context switches due to syscalls).
Just an opinion. 🙂

Aaron

We’ve got 20,000,000 bank loan records we query against all sorts of tables. Everything is real real slow. We’re using LAMP. I could send the table structures and queries/ php cocde that tends to bog down.

azmi

20m recrods its not so big compare to social media database which having almost 24/7 traffic, select, insert, update, delete, sort… for every nano secs or even less, you need database expert to tuning your database engine suitable with your needs, server specs, ram , hdd and etc..

Paul

I implemented a simple logging of all my web sites access to make some statistics (sites access per day, ip address, search engine source, search queries, user text entries, …) but most of my queries went way too slow to be of any use last year. The problem started when I got to around 600,000 rows (table size: 290MB).

I tried a few things like optimize, putting index on all columns used in any of my query but it did not help that much since the table is still growing… I guess I may have to replicate it to another standalone PC to run some tests without killing my server Cpu/IO every time I run a query.

I am not using any join, I will try the ‘explain’ and the ‘IGNORE INDEX()’ when I have a chance although I don’t think it will help since I added indexes after I saw the problem.

The index does make it very fast for one of my table on another project (list of all cities in the world: 3 million rows). I guess it’s all about memory vs hard disk access.

Now my question is for a current project that I am developing. You can think of it as a webmail service like google mail, yahoo or hotmail.

Right now I am wondering if it would be faster to have one table per user for messages instead of one big table with all the messages and two indexes (sender id, recipient id). This could mean millions of table so it is not easy to test.

I will probably write a random users/messages generator to create a million user with a thousand message each to test it but you may have already some information on this so it may save me a few days of guess work.

Any suggestions would be appreciated.

Apachez

8. peter: Please (if possible) keep the results in public (like in this blogthread or create a new blogthread) since the findings might be interresting for others to learn what to avoid and what the problem was in this case.

11. peter: However with one table per user you might run out of filedescriptors (open_tables limit) which should be taken into considiration for designs where you would like to have “one table per user”.

Martin Keller

I have a table with 35 mil records. When I wanted to add a column (alter table) I would take about 2 days. Writing my own program in
c# that prepared a file for import shortened this task to about 4 hours.

Daniel

I’m just dealing with the same issue with a message system.

At the moment I have one table (myisam/mysql4.1) for users inbox and one for all users sent items.
Now the inbox table holds about 1 million row with nearly 1 gigabyte total. The
sent items is the half.
At this point it is working well with over 700 concurrent user.
Now I’m doing a recode and there should be a lot more functions like own folders etc. and the queries will be a lot more complex.
That’s why I’m now thinking about useful possibilities of designing the message table and about whats the best solution for the future.

Is it really useful to have an own message table for every user?

Thanks a lot for any suggestions.

System: It’s now on a 2xDualcore Opteron with 4GB Ram/Debian/Apache2/MySQL4.1/PHP4/SATA Raid1)
In near future I will have the Apache on a dedicated machine and the Mysql Server too (and the next step will be a Master/Slave server setup for the database).

Sergey

Peter,

I think you can give me some advise. I am running MySQL 4.1 on RedHat Linux. The box has 2GB of RAM, it has dual 2.8GHz Xeon processors, and /etc/my.cnf file looks like this.


[mysqld]
...
key_buffer = 512M
max_allowed_packet = 8M
table_cache = 512
sort_buffer_size = 32M
read_buffer_size = 32M
read_rnd_buffer_size = 128M
myisam_sort_buffer_size = 256M
thread_cache = 32
query_cache_size = 256M

I am running data mining process that updates/inserts rows to the table (i.e. I do multifield select on indexed fields, and if row is found, I update the data, if not I insert new row). Up to about 15,000,000 rows (1.4GB of data) the procedure was quite fast (500-1000 rows per second), and then it started to slow down. Now I have about 75,000,000 rows (7GB of data) and I am getting about 30-40 rows per second.

D.ZHU

Hi peter,

I’m doing a coding project that would result in massive amounts of data (will reach somewhere like 9billion rows within 1 year). supposing i’m completely optimized…

* If i run a ‘select from where…’ query, how long is the query likely to take?
* also how long would an insert take?

I see you have in the example above, 30 millions of rows of data and a select took 29mins!

* and how would i estimate such performance figures?

Labus

Normally MySQL is rather fast loading data in MyISAM table, but there is exception, which is when it can’t rebuild indexes by sort but builds them
row by row instead. It can be happening due to wrong configuration (ie too small myisam_max_sort_file_size or myisam_max_extra_sort_file_size) or
it could be just lack of optimization, if you’re having large (does not fit in memory) PRIMARY or UNIQUE indexes.

Below is the internal letter I’ve sent out on this subject which I guessed would be good to share

Today on my play box I tried to load data into MyISAM table (which was
previously dumped as mysqldump –tab)

The data was some 1.3G, 15.000.000 rows, 512MB memory one the box.
MySQL 4.1.8

One could could call it trivial fast task, unfortunately I had
unique key on varchar(128) as part of the schema.

The load took some 3 hours before I aborted it finding out it was just
about 20% done.

The problem is – unique keys are always rebuilt using key_cache, which
means we’re down to some 100-200 rows/sec as soon as index becomes
significantly larger than memory.

You can’t go away with ALTER TABLE DISABLE KEYS as it does not affect
unique keys.

Len

Google may use Mysql but they don’t necessarily have billions of rows – just because google uses MySQL doesn’t mean they actually use it for their search engine results. They have many little sections in their website you know. Laughably they even used PHP for one project. Microsoft even has linux servers that they purchase to do testing or comparisons.

Jay Pipes

@Len: not quite sure what you’re getting at…other than being obtuse.

If you’d like to know how and what Google uses MySQL for (yes, AdSense, among other things), come to the Users Conference in April (http://mysqlconf.com). 4 Googlers are speaking there, as is Peter. Also, I don’t understand your aversion to PHP… what about using PHP is laughable? Sounds to me you are just flame-baiting…

Jasper

I have made an online dictionary using a MySQL query I found online. It has been working pretty well until today. Now the page loads quite slowly. I wonder how I can optimize my table. The things you wrote here are kind of difficult for me to follow.
My table has 545 rows, id column, dutch words column, Thai characters, phonetics, is dynamic, has 59 bytes per row, auto-indexes, has a collation: utf8 unicode, data: 25,820 bytes index: 6,144 bytes and total: 31,964 bytes of data. Should I split up the data to load iit faster or use a different structure? Thanks for your suggestions.

Bart R

Hello peter,

I have a table with a unique key on two columns (STRING, URL). The table contains 36 million rows (Data size 5GB, Index size 4GB).

I run the following query, which takes 93 seconds !

SELECT TITLE FROM GRID WHERE STRING = ‘sport’;

When I run the query below, it only takes 0.1 seconds :

SELECT COUNT(*) FROM GRID WHERE STRING = ‘sport’;

So while the where-clause is the same, the first query takes much more time.
The slow part of the query is thus the retrieving of the data. I came to this
conclusion also because the query took longer the more rows were retrieved.

Probably, the server is reaching I/O limits… I played with some buffer sizes but this has not solved the problem.. Has anyone experience with table size this large ? Is MySQL able to handle tables (MyIsam) this large ?

Also, is it an option to split this big table in 10 smaller tables ?

CREATE TABLE GRID (
ID bigint(20) NOT NULL auto_increment,
STRING varchar(100) character set utf8 collate utf8_unicode_ci NOT NULL default ”,
URL varchar(230) character set utf8 collate utf8_unicode_ci NOT NULL default ”,
LANGUAGE char(2) NOT NULL default ‘EN’,
COUNTRY char(2) NOT NULL,
TITLE varchar(255) character set utf8 collate utf8_unicode_ci NOT NULL default ”,
DESCRIPTION text character set utf8 collate utf8_unicode_ci,
POINTS decimal(10,2) NOT NULL default ‘0.00’,
PRIMARY KEY (ID),
UNIQUE KEY string (STRING,URL)

Nick

Peter,
I have similar situation to the message system, only mine data set would be even bigger. I have several data sets and each of them would be around 90,000,000 records, but each record has just a pair of IDs as compository primary key and a text, just 3 fields.
Number of IDs would be between 15,000 ~ 30,000 depends of which data set. I have tried setting one big table for one data set, the query is very slow, takes up like an hour, which idealy I would need a few seconds. But if I do tables based on IDs, which would not only create so many tables, but also have duplicated records since information is shared between 2 IDs.
What would be the best way to do it?

Jacob

Hello,

I am building a statistics app that will house 9-12 billion rows. I am trying to use Mysql Clustering, to the ndbcluster engine. Hardware is not an issue, that is to say I can get whatever hardware I need to do the job. Select times are reasonable, but insert times are very very very slow. As an example, in a basic config using MyISM tables I am able to insert 1million rows in about 1-2 min. The Hardware servers I am testing on are 2.4G Xeon CPU with a 1GB RAM and a Gig network. However, with ndbcluster the exact same inserts are taking more than 15 min. Anyone have any ideas on how I can make this faster?

Martin

Im working on a project which will need some tables with about 200-300 million rows.
Each row consists of 2x 64 bit integers.
The table structure is as follows:
CREATE TABLE z_chains_999 (
startingpoint bigint(8) unsigned NOT NULL,
endingpoint bigint(8) unsigned NOT NULL,
PRIMARY KEY (startingpoint,endingpoint)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 ROW_FORMAT=FIXED;

My problem is, as more rows are inserted, the longer time it takes to insert more rows.
I insert rows in batches of 1.000.000 rows.
First insertion takes 10 seconds, next takes 13 seconds, 15, 18, 20, 23, 25, 27 etc. I fear when it comes up to 200 million rows.
Removing the PRIMARY KEY stops this problem, but i NEED IT.. Any suggestions what to do?
my key_buffer is set to 1000M, but this problem already begins long before the memory is full.

Martin

It gets slower and slower for every 1 million rows i insert. The first 1 million row takes 10 seconds to insert, after 30 million rows, it takes 90 seconds to insert 1 million rows more.
I need to do 2 queries on the table. 1st one (which is used the most) is “SELECT COUNT(*) FROM z_chains_999”, the second, which should only be used a few times is “SELECT * FROM z_chains_999 ORDER BY endingpoint ASC”

I found that setting delay_key_write to 1 on the table stops this from happening. Now it remains on a steady 12 seconds every time i insert 1 million rows. What exactly is it this option does?
After 26 million rows with this option on, it suddenly takes 520 seconds to insert the next 1 million rows.. Any idea why? Maybe the memory is full?

Will

Peter, I just stumbled upon your blog by accident. Very good info!

I’m building an evaluation system with about 10-12 normalized tables. There are two main output tables that most of the querying will be done on. I filled the tables with 200,000 records and my query won’t even run. It’s losing connection to the db server. I tried SQL_BIG_RESULT, analyze table, etc… nothing seems to help

Here’s my query. It does lot of counting. Does this look like a performance nightmare waiting to happen?

Thanks,
Will

SELECT
Q.questionID,
Q.questionsetID,
Q.question,
Q.questioncatid,
QAX.questionid,
QAX.answersetid,
ASets.answersetid,
ASets.answersetname,
ASAX.answersetid,
ASAX.answerid,
A.answerID,
A.answername,
A.answervalue,
COUNT(DISTINCT e3.evalanswerID) AS totalforthisquestion,
COUNT(DISTINCT e1.evalanswerID) AS totalforinstructor,
(COUNT(DISTINCT e3.evalanswerID)/COUNT(DISTINCT e1.evalanswerID)*100)
AS answerpercentage
FROM tblquestions Q
INNER JOIN tblquestionsanswers_x QAX USING (questionid)
INNER JOIN tblanswersets ASets USING (answersetid)
INNER JOIN tblanswersetsanswers_x ASAX USING (answersetid)
INNER JOIN tblanswers A USING (answerid)
LEFT JOIN (tblevalanswerresults e1 INNER JOIN tblevaluations e2 ON
e1.evalid = e2.evalid
AND e2.InstructorID = ‘1021338’

) ON e1.questionid = Q.questionID
LEFT JOIN (tblevalanswerresults e3 INNER JOIN tblevaluations e4 ON
e3.evalid = e4.evalid
AND e4.InstructorID = ‘1021338’

) ON e3.questionid = Q.questionID AND
e3.answerID = A.answerID

GROUP BY
Q.questionID,
Q.questionsetID,
Q.question,
ASets.answersetid,
Q.questioncatid,
A.answerID,
A.answername,
A.answervalue
HAVING Q.questioncatid = 1

UNION
/**The following query is just for the totals, and does not include the
group columns**/
SELECT
9999,
NULL,
‘Totals’,
Q.questionID,
Q.questionsetID,
Q.question,
Q.questioncatid,
QAX.questionid,
ASets.answersetid,
ASets.answersetname,
A.answerID,
A.answername,
A.answervalue,
COUNT(DISTINCT e3.evalanswerID) AS totalforthisquestion,
COUNT(DISTINCT e1.evalanswerID) AS totalforinstructor,
(COUNT(DISTINCT e3.evalanswerID)/COUNT(DISTINCT e1.evalanswerID)*100)
AS answerpercentage
FROM tblquestions Q
INNER JOIN tblquestionsanswers_x QAX USING (questionid)
INNER JOIN tblanswersets ASets USING (answersetid)
INNER JOIN tblanswersetsanswers_x ASAX USING (answersetid)
INNER JOIN tblanswers A USING (answerid)
LEFT JOIN (tblevalanswerresults e1 INNER JOIN tblevaluations e2 ON
e1.evalid = e2.evalid
AND e2.InstructorID = ‘1021338’

) ON e1.questionid = Q.questionID
LEFT JOIN (tblevalanswerresults e3 INNER JOIN tblevaluations e4 ON
e3.evalid = e4.evalid
AND e4.InstructorID = ‘1021338’

) ON e3.questionid = Q.questionID AND
e3.answerID = A.answerID

GROUP BY Q.questioncatid, ASets.answersetname,A.answerID,A.answername,A.answervalue

HAVING Q.questioncatid = 1

Adesh

I am having one query

SELECT DISTINCT spp.provider_profile_id, sp.provider_id, sp.business_name, spp.business_phone, spp.business_address1, spp.business_address2, spp.city, spp.region_id, spp.state_id, spp.rank_number, spp.zipcode, sp.sic1, sp.approved
FROM service_provider sp
INNER JOIN service_provider_profile spp ON sp.provider_id = spp.provider_id
WHERE sp.approved = ‘Y’
AND spp.master_status = ‘0’
ORDER BY sp.business_name ASC
LIMIT 0 , 100

In all three tables there are more than 7 lakh record. My query doesn’t work at all
Any solution……….?
I m using php 5 and MySQL 4.1………

Andreas

Hi, Im working proffesionally with postgresql and mssql and at home im using mysql for my leasure projects .. Currently Im working on a project with about 150.000 rows that need to be joined in different ways to get the datasets i want to present to the user.

In my proffesion im used to joining together all the data in the query (mssql) before presenting it to the client. Is this wise .. i.e.

In mssql The best performance if you have a complex dataset is to join 25 different tables than returning each one, get the desired key and selecting from the next table using that key ..

But as I understand in mysql it’s best not to join to much .. Is this correct ..

Fernando

Hello Guys
This article puzzles a bit. I have a project I have to implement with open-source software. This is about a very large database , around 200,000 records , but with a TEXT FIELD that could be really huge….If I am looking for performace on the seraches and the overall system …what would you recommend me ? (In terms of Software and hardware configuration)

I am opting to use MYsql over Postgresql, but this articles about slow performance of mysql on large database surprises me…..

By the way….on the other hard, Does Mysql support XML fields ?

andreas

Hi again, Indeed, this article is about common misconfgigurations that people make .. including me .. Im used to ms sql server which out of the box is extremely fast .. But for my mysql server I’m having performance issues, sÃ¥ my question remains, what is the best route, join and complex queries, or several simple queries. What im asking for is what mysql does best, lookup and indexes och returning data.

Andreas Thörn

Hi. I quess I have to experiment a bit, Does anyone have any good newbie tutorial configuring MySql .. My server isn’t the fastest in the world, so I was hoping to enhance performance by tweaking some parameters in the conf file, but as everybody know, tweaking without any clue how different parameters work together isn’t a good idea ..

Jim

Hi, I have a table I am trying to query with 300K records which is not large relatively speaking. My problem is some of my queries take up to 5 minutes and I can’t seem to put my finger on the problem. I have tried indexes and that doesn’t seem to be the problem. My query is based on keywords. I get the keyword string then look up the id. I then use the id of the keyword to lookup the id of my record. I then build a SELECT query. My SELECT statement looks something like
SELECT * FROM table_name WHERE (year > 2001) AND (id = 345 OR id = 654 ….. OR id = 90)
The second set of parenthesis could have 20k+ conditions. Is there another way to approach this?
I am running MYSQL 5.0

Will

Take the * out of your select, and name the columns you need. That should improve it somewhat.

Will

Jim

Will,
I’m not using an * in my actual statement
my actual statement looks more like
SELECT id FROM table_name WHERE (year > 2001) AND (id = 345 OR id = 654 ….. OR id = 90)

Even the count(*) takes over 5 minutes on some queries.

Jim

To answer my own question I seemed to find a solution.
I used the IN clause and it sped my query up considerably.
To use my example from above

SELECT id FROM table_name WHERE (year > 2001) AND (id IN( 345,654,…, 90))

Erick

This article is BS. MySQL sucks on big databases, period. The big sites such as Slashdot and so forth have to use massive clusters and replication. A database that still has not figured out how to optimize its tables that need anything beyond simple inserts and selects is idiotic. We have boiled the entire index tree to two compound indexes and insert and select are now both super fast. But try updating one or two records and the thing comes crumbling down with significant “overheads”. InnoDB is suggested as an alternative. The select speed on InnoDB is painful and requires huge hardware and memory to be meaningful. Basically: we’ve moved to PostgreSQL, which is a real database and with version 8.x is fantastic with speed as well.

Eric

Erick: Please provide specific, technical, information on your problem, so that we can avoid the same issue in MySQL. For example, how large were your MySQL tables, system specs, how slow were your queries, what were the results of your explains, etc. Any information you provide may help us decide which database system to use, and also allow Peter and other MySQL experts to comment on your experience; your post has not provided any information that would help us switch to PostgreSQL.
Eric

Erick

Ok, here are specifics from one system. It has exactly one table. MYISAM table with the following activity:

Per day:

1. INSERTS: 1,000
2. UPDATES: 200
3. SELECTS: 1 million

Since this is a predominantly SELECTed table, I went for MYISAM. My my.cnf variables were as follows on a 4GB RAM system, Red Hat Enterprise with dual SCSI RAID:

[mysqld]
long_query_time=5

skip-bdb
skip-innodb
skip-locking

query_cache_limit=1M
query_cache_size=32M
query_cache_type=1
max_connections=1500
interactive_timeout=25
wait_timeout=10
connect_timeout=5
thread_cache_size=60
key_buffer=750M
join_buffer=10M

max_heap_table_size=50M
tmp_table_size=64M

max_allowed_packet=16M
table_cache=1800
record_buffer=10M
sort_buffer_size=24M
read_buffer_size=9M
max_connect_errors=10
thread_concurrency=4
myisam_sort_buffer_size=950M
character-set-server=utf8
default-collation=utf8_unicode_ci
set-variable=max_connections=1500
log_slow_queries=/var/log/mysql-slow.log
sql-mode=TRADITIONAL
concurrent_insert=2
low_priority_updates=1

The problem was that at about 3pm GMT the SELECTs from this table would take about 7-8 seconds each on a very simple query such as this:

SELECT column2, column3 FROM table1 WHERE column1 = id;

The index is on column1. So when I would “REPAIR TABLE table1 QUICK” at about 4pm, the above query would execute in 0.00 seconds. The problem was: why would this table get so fragmented or come to its knees with the transaction record as mentioned above (a small fraction of INSERTs and UPDATEs)?

We explored a bunch of issues including questioning our hardware and our system administrators 🙂 When we switched to PostgreSQL, there was no such issue. We do a “VACCUM” every *month* or so and we’re fine.

MySQL, I have come to realize, is as good as a file system on steroids and nothing more. InnoDB doesn’t cut it for me if the backup and all of that is so very cumbersome (mysqlhotcopy is not available, for instance) and eking performance out of an InnoDB table for raw SELECT speed will take a committee of ten PhDs in RDBMS management. Not kosher.

The above example is based on one very simple website. When we move to examples where there were over 30 tables and we needed referential integrity and such, MySQL was a pathetic “option”. Yes 5.x has included triggers, stored procedures, and such, but they’re a joke. A simple AFTER INSERT trigger takes about 7 second. What gives? Integrity checks don’t work — try making a check on a column NOT NULL to include NOT EMPTY (i.e., no blank space can be entered, which as you know, is different from NULL). We will have to do this check in the application.

Just my experience. I know some big websites are using MySQL, but we had neither the budget to throw all that staff, or time, at it. PostgreSQL solved it for us.

Sorry for mentioning this on a mysql performance blog. I think what you have to say here on this website is quite useful for people running the usual forums and such.

Erick

I may add that this one table had 3 million rows, and growing pretty slowly given the insert rate.