June 9, 2006

Why MySQL could be slow with large tables ?

Posted by peter

If you've been reading enough database related forums, mailing lists or blogs you probably heard complains about MySQL being unable to handle more than 1.000.000 (or select any other number) rows by some of the users. On other hand it is well known with customers like Google, Yahoo, LiveJournal,Technocarati MySQL has installations with many billions of rows and delivers great performance. What could be the reason ?

The reason is normally table design and understanding inner works of MySQL. If you design your data wisely considering what MySQL can do and what it can't you will get great perfomance if not, you might become upset and become one of thouse bloggers. Note - any database management system is different in some respect and what works well for Oracle,MS SQL, PostgreSQL may not work well for MySQL and other way around. Even storage engines have very important differences which can affect performance dramatically.

The three main issues you should be concerned if you're dealing with very large data sets are Buffers, Indexes and Joins.

Buffers
First thing you need to take into account is the fact - situation when data fits in memory and when it does not are very different. If you started from in-memory data size and expect gradual performance decrease as database size grows you may be surprised by serve drop in performance. This especially apples to index lookus and joins which we cover later. As everything usually slows down a lot once it does not fit in memory the good solution is to make sure your data fits in memory as good as possible. This could be done by data partitioning (ie old and rarely accessed data stored in different servers), multi-server partitioning to use combined memory and a lot of other technics which I should cover at some later time.

So you understand how much having data in memory changed things here is small example with numbers. If you have your data fully in memory you could perform over 300.000 of random lookups per second from single thread depending on system and table structure. Now if you data fully on disk (both data and index) you would need 2+ IOs to retrieve the row which means you get about 100 rows/sec. Note multiple drives do not really help a lot as we're speaking about single thread/query here. So difference is 3.000 times! It might be a bit too much as there are few completely uncached workloads but 100+ times difference is quite frequent.

Indexes
What everyone knows about indexes is the fact they are good to speed up accesses to database. Some people would also remember if indexes are helpful or not depends on index selectivity - how large proportion of rows matches to particular index value or range. What is often forgotten about is - depending if workload is cached or not different selectivity might show benefit from using indexes. In fact even MySQL optimizer currently does not take it into account. For In memory workload index accesses might be faster even if 50% of rows are accessed, while for disk IO bound accessess we might be better of doing full table scan even if only few percent or rows are accessed.

Lets do some computations again. Consider table which has 100 byte rows. With decent SCSI drive we can get 100MB/sec read speed which gives us about 1.000.000 rows per second for fully sequential access, jam packed rows - quite possible scenario for MyISAM tables. Now if we take the same hard drive for fully IO bound workload it will be able to provide just 100 row lookups by index pr second. The difference is 10.000 times for our worse case scenario. It might be not that bad in practice but again it is not hard to reach 100 times difference.

Here is little illustration I've created the table with over 30 millions of rows. "val" column in this table has 10000 distinct value, so range 1..100 selects about 1% of the table. The times for full table scan vs range scan by index:

SQL:
  1. mysql> SELECT count(pad) FROM large;
  2. +------------+
  3. | count(pad) |
  4. +------------+
  5. |   31457280 |
  6. +------------+
  7. 1 row IN SET (4 min 58.63 sec)
  8.  
  9. mysql> SELECT count(pad) FROM large WHERE val BETWEEN  1 AND 100;
  10. +------------+
  11. | count(pad) |
  12. +------------+
  13. |     314008 |
  14. +------------+
  15. 1 row IN SET (29 min 53.01 sec)

Also remember - not all indexes are created equal. Some indexes may be placed in sorted way or pages placed in random places - this may affect index scan/range scan speed dramatically. The rows referenced by indexes also could be located sequentially or require radom IO if index ranges are scanned. There are also clustered keys in Innodb which combine index access with data access, saving you IO for completely disk bound workloads.

There are certain optimizations in works which would improve performance of index accesses/index scans. For example retrieving index values first and then accessing rows in sorted order can be a lot of help for big scans. This will reduce the gap but I doubt it will be closed.

Joins
Joins are used to compose the complex object which was previously normalized to several tables or perform complex queries finding relationships between objects. Normalized structure and a lot of joins is right way to design your database as textbooks teach you... but when dealing with large data sets it could be recepie to disaster. The problem is not the data size - normalized data normally becomes smaller, but dramatically increased number of index lookups which could be random accesses. This problem exists for all kinds of applications, however for OLTP applications with queries examining only few rows it is less of the problem. Data retrieval, search, DSS, business intelligence applications which need to analyze a lot of rows run aggregates etc is when this problem is the most dramatic.

Some joins are also better than others. For example if you have star join with dimention tables being small it would not slow things down too much. On other hand join of few large tables, which is completely disk bound can be very slow.

One of the reasons elevating this problem in MySQL is lack of advanced join methods at this point (the work is on a way) - MySQL can't do hash join or sort merge join - it only can do nested loops method which requires a lot of index lookups which may be random.

Here is good example. As we saw my 30mil rows (12GB) table was scanned in less than 5 minutes. Now if we would do eq join of the table to other 30mil rows table and it will be completely random. We'll need to perform 30 millions of random row reads, which gives us 300.000 seconds with 100 rows/sec rate. So we would go from 5 minutes to almost 4 days if we need to do the join. Some people assume join would be close to two full table scans (as 60mil of rows need to be read) - this is way wrong.

Do not take me as going against normalization or joins. It is great principle and should be used when possible. Just do not forget about performance implications designing the system and do not expect joins to be be free.

Finally I should mention one more MySQL limitation which requires you to be extra careful working with large data sets. In MySQL single query runs as single thread (with exeption of MySQL Cluster) and MySQL issues IO requests one by one for query execution, which means if single query execution time is your concern many hard drives and large number of CPUs will not help. Sometimes it is good idea to manually split query into several, run in parallel and aggregate result sets.

So if you're dealing with large data sets and complex queries here are few tips

Try to fit data set you're working with in memory - Processing in memory is so much faster and you have whole bunch of problems solved just doing so. Use multiple servers to host portions of data set. Store portion of data you're going to work with in temporary table etc.

Prefer full table scans to index accesses - For large data sets full table scans are often faster than range scans and other types of index lookups. Even if you look at 1% or rows or less full table scan may be faster.

Avoid joins to large tables Joining of large data sets using nested loops is very expensive. Try to avoid it. Joins to smaller tables is OK but you might want to preload them to memory before join so there is no random IO needed to populate the caches.

With proper application architecture and table design you can build applications operating with very large data sets based on MySQL.

Related posts: :MySQL Replication and Slow Queries::Microslow patch for 5.1.20::Does Slow query log logs all slow queries ?:
 

72 Comments »

  1. Good post Peter..

    just a couple of questions to clarify somethings.

    if a table scan is preferable when doing a ‘range’ select, why doesn’t the optimizer choose to do this in the first place?

    is there some sort of rule of thumb here.. use a index when you expect your queries to only return X% of data back? otherwise put a hint in your SQL to force a table scan ?

    what changes are in 5.1 which change how the optimzer parses queries..

    does running optimize table regularly help in these situtations? or just when you have a large change in your data distribution in your table?

    oh.. one tip for your readers.. always run ‘explain’ on a fully loaded database to make sure your indexes are being used

    Comment :: June 9, 2006 @ 5:14 pm

  2. 2. peter

    Ian,

    Right. In theory optimizer should know and select it automatically. In fact it is not smart enough. It also simply does not have the data available - is given index (range) currently in memory or will it need to read it from the disk ? How much index is fragmented ? How random accesses would be to retrieve the rows. Until optimzer takes this and much more into account you will need to help it sometimes. I did not mentioned it in the article but there is IGNORE INDEX() hint to force full table scan.

    There is no rule of thumb. As you could see in the article in the test I’ve created range covering 1% of table was 6 times slower than full table scan… which means at about 0.2% table scan is preferable. In other cases especially for cached workload it can be as much as 30-50%.

    In MySQL 5.1 there are tons of little changes. What change you’re speaking about ?

    OPTIMIZE helps for certain problems - ie it sorts indexes themselves and removers row fragmentation (all for MYISAM tables). It however can’t make row retrieval which is done by index sequential one. One more hint if you have all your ranges by specific key ALTER TABLE … ORDER BY key would help a lot.

    Yes. Runing explain is good idea. Also do not forget to try it out for different constants - plans are not always the same.

    Comment :: June 9, 2006 @ 11:59 pm

  3. Ian,
    as I wrote in http://www.mysqlperformanceblog.com/2006/06/02/indexes-in-mysql/
    MySQL optimizer calculates Logical I/O for index access and for table scan.
    One big mistake here, I think, MySQL makes assumption 100 key comparison
    like ” if (searched_key == current_key)” is equal to 1 Logical I/O.
    But I believe on modern boxes constant 100 should be much bigger.

    Comment :: June 10, 2006 @ 2:41 am

  4. 4. Alexey

    For most workloads you’ll always want to provide enough memory to key cache so its hit ratio is like 99.9%. Even if a table scan looks faster than index access on a cold-cache benchmark, it doesn’t mean that it’s a good idea to use table scans. It can easily hurt overall system performance - by trashing OS disk cache, and if we compare table scan on data cached by OS and index scan on keys cached by MySQL, table scan uses more CPU (because of syscall overhead and possible context switches due to syscalls).
    Just an opinion. :)

    Comment :: June 10, 2006 @ 7:07 am

  5. [...] Why MySQL could be slow with large tables ?: “If you’ve been reading enough database related forums, mailing lists or blogs you probably heard complains about MySQL being unable to handle more than 1.000.000 (or select any other number) rows by some of the users. On other hand it is well known with customers like Google, Yahoo, LiveJournal,Technocarati MySQL has installations with many billions of rows and delivers great performance. What could be the reason ? [...]

    Pingback :: June 10, 2006 @ 7:25 pm

  6. 6. peter

    Alexey,

    You probably missunderstood this article. I’m writing about working with large data sets, these are then your tables and your working set do not fit in memory. This is the case then full table scan will actually require less IO than using indexes. And this is when you can’t get 99.99% keycache hit rate. Not to mention keycache rate is only part of the problem - you also need to read rows which might be much larger and so not so well cached.

    As you probably seen from the article my first advice is to try to get your data to fit in cache. If it is possible you instantly will have half of the problems solved. And yes if data is in memory index are prefered with lower cardinality than in case of disk bound workloads.

    Comment :: June 14, 2006 @ 7:41 am

  7. We’ve got 20,000,000 bank loan records we query against all sorts of tables. Everything is real real slow. We’re using LAMP. I could send the table structures and queries/ php cocde that tends to bog down.

    Comment :: June 16, 2006 @ 3:44 pm

  8. 8. peter

    Hi Aaron,

    Please feel free to send it to me to pz at mysql performance blog.com. Just do not forget EXPLAIN for your queries and if you have php to load it up with some “random” data which is silimar to yours that would be great.

    We should take a look at your queries to see what could be done.

    Comment :: June 17, 2006 @ 6:14 am

  9. MySQL performance debate…

    I must admit after SPEC jAppServer2004™ results with MySQL , my interest rose about MySQL being a good candidate for high performance database application. If Google Adsense , Flick’r , Technorati and many others use it for years, can it be…

    Trackback :: June 18, 2006 @ 8:17 am

  10. 10. Paul

    I implemented a simple logging of all my web sites access to make some statistics (sites access per day, ip address, search engine source, search queries, user text entries, …) but most of my queries went way too slow to be of any use last year. The problem started when I got to around 600,000 rows (table size: 290MB).

    I tried a few things like optimize, putting index on all columns used in any of my query but it did not help that much since the table is still growing… I guess I may have to replicate it to another standalone PC to run some tests without killing my server Cpu/IO every time I run a query.

    I am not using any join, I will try the ‘explain’ and the ‘IGNORE INDEX()’ when I have a chance although I don’t think it will help since I added indexes after I saw the problem.

    The index does make it very fast for one of my table on another project (list of all cities in the world: 3 million rows). I guess it’s all about memory vs hard disk access.

    Now my question is for a current project that I am developing. You can think of it as a webmail service like google mail, yahoo or hotmail.

    Right now I am wondering if it would be faster to have one table per user for messages instead of one big table with all the messages and two indexes (sender id, recipient id). This could mean millions of table so it is not easy to test.

    I will probably write a random users/messages generator to create a million user with a thousand message each to test it but you may have already some information on this so it may save me a few days of guess work.

    Any suggestions would be appreciated.

    Comment :: June 19, 2006 @ 1:46 pm

  11. 11. peter

    Paul,

    300MB table is tiny. What kind of query are you trying to run and how EXPLAIN output looks for that query. Also what is your MySQL Version ?

    There are some other tricks which you need to consider - for example if you do GROUP BY and number of resulting rows is large you might get pretty poor speed because temporary table is used and it grows large. Using SQL_BIG_RESULT helps to make it use sort instead.

    Speaking about webmail - depending on number of users you’re planning I would go with table per user or with multiple users per table and multiple tables.

    Besides having your tables more managable you would get your data clustered by message owner, which will speed up opertions a lot. Also this means once user logs in and views messages they will be cached in OS cache or MySQL buffers speeding up further work dramatically.

    Comment :: June 19, 2006 @ 2:06 pm

  12. 12. Apachez

    8. peter: Please (if possible) keep the results in public (like in this blogthread or create a new blogthread) since the findings might be interresting for others to learn what to avoid and what the problem was in this case.

    11. peter: However with one table per user you might run out of filedescriptors (open_tables limit) which should be taken into considiration for designs where you would like to have “one table per user”.

    Comment :: June 20, 2006 @ 8:26 am

  13. 13. peter

    Apachez - yes this is the point.

    You get free answers to your questions by asking them in this blog (or at MySQL Forums) but other people can benefit from the answers as well.

    Speaking about “table per user” - it does not mean you will run out of file descriptors. “table_cache” is what defines how many tables will be opened and you can configure it independently of number of tables you’re using. You however want to keep value hight in such configuration to avoid constant table reopens. With Innodb tables you also have all tables kept open permanently which can waste a lot of memory but it is other problem. Speaking about “open_file_limit” which limits number of files MySQL can use at the same time - on modern operation systems it is safe to set it to rather high values. I used MySQL with other 100.000 of files opened at the same time with no problems. Shutdown can be long in such case though.

    Comment :: June 21, 2006 @ 5:41 am

  14. I have a table with 35 mil records. When I wanted to add a column (alter table) I would take about 2 days. Writing my own program in
    c# that prepared a file for import shortened this task to about 4 hours.

    Comment :: August 14, 2006 @ 2:09 pm

  15. Hm. I’m actually quite surprised. ALTER TABLE normally rebuilds indexes by sort, so does LOAD DATA INFILE (Assuming we’re speaking about MyISAM table) so such difference is quite unexpected. Might be for some reason ALTER TABLE was doing index rebuild by keycache in your tests, this would explain it.

    ALTER TABLE and LOAD DATA INFILE should nowever look on the same settings to decide which method to use.

    Comment :: August 14, 2006 @ 2:15 pm

  16. 16. Daniel

    I’m just dealing with the same issue with a message system.

    At the moment I have one table (myisam/mysql4.1) for users inbox and one for all users sent items.
    Now the inbox table holds about 1 million row with nearly 1 gigabyte total. The
    sent items is the half.
    At this point it is working well with over 700 concurrent user.
    Now I’m doing a recode and there should be a lot more functions like own folders etc. and the queries will be a lot more complex.
    That’s why I’m now thinking about useful possibilities of designing the message table and about whats the best solution for the future.

    Is it really useful to have an own message table for every user?

    Thanks a lot for any suggestions.

    System: It’s now on a 2xDualcore Opteron with 4GB Ram/Debian/Apache2/MySQL4.1/PHP4/SATA Raid1)
    In near future I will have the Apache on a dedicated machine and the Mysql Server too (and the next step will be a Master/Slave server setup for the database).

    Comment :: August 20, 2006 @ 3:06 am

  17. Daniel,

    I would surely go with multiple tables. If it should be table per user or not depends on numer of users. For 1000 users that would work but for 100.000 it would be too many tables.

    I would have many to many mapping from users to tables so you can decide how many users you put per table later and I would also use composite primary keys if you’re using Innodb tables so data is clustered by user.

    Comment :: August 20, 2006 @ 5:28 am

  18. 18. Sergey

    Peter,

    I think you can give me some advise. I am running MySQL 4.1 on RedHat Linux. The box has 2GB of RAM, it has dual 2.8GHz Xeon processors, and /etc/my.cnf file looks like this.


    [mysqld]

    key_buffer = 512M
    max_allowed_packet = 8M
    table_cache = 512
    sort_buffer_size = 32M
    read_buffer_size = 32M
    read_rnd_buffer_size = 128M
    myisam_sort_buffer_size = 256M
    thread_cache = 32
    query_cache_size = 256M

    I am running data mining process that updates/inserts rows to the table (i.e. I do multifield select on indexed fields, and if row is found, I update the data, if not I insert new row). Up to about 15,000,000 rows (1.4GB of data) the procedure was quite fast (500-1000 rows per second), and then it started to slow down. Now I have about 75,000,000 rows (7GB of data) and I am getting about 30-40 rows per second.

    Comment :: August 21, 2006 @ 1:27 pm

  19. Sergey, Would you mind posting your case on our forums instead at
    http://forum.mysqlperformanceblog.com and I’ll reply where. This way more users will benefit from your question and my reply.

    Comment :: August 21, 2006 @ 3:17 pm

  20. 20. Sergey

    http://forum.mysqlperformanceblog.com/s/t/17/

    Comment :: August 21, 2006 @ 4:21 pm

  21. 21. D.ZHU

    Hi peter,

    I’m doing a coding project that would result in massive amounts of data (will reach somewhere like 9billion rows within 1 year). supposing i’m completely optimized…

    * If i run a ’select from where…’ query, how long is the query likely to take?
    * also how long would an insert take?

    I see you have in the example above, 30 millions of rows of data and a select took 29mins!

    * and how would i estimate such performance figures?

    Comment :: November 7, 2006 @ 8:05 pm

  22. D Zhu,

    Answer depends on selectivity at large extent as well as if where clause is matched by index or full scan is performed.

    You also need to consider how wide are rows - dealing with 10 byte rows is much faster than 1000 byte rows.

    Regarding how to estimate I would do some benchmarks and match them against what you would expect and what you’re looking for. There are many design and configuration alternatives to deliver you what you’re looking for

    Comment :: November 12, 2006 @ 8:29 pm

  23. Normally MySQL is rather fast loading data in MyISAM table, but there is exception, which is when it can’t rebuild indexes by sort but builds them
    row by row instead. It can be happening due to wrong configuration (ie too small myisam_max_sort_file_size or myisam_max_extra_sort_file_size) or
    it could be just lack of optimization, if you’re having large (does not fit in memory) PRIMARY or UNIQUE indexes.

    Below is the internal letter I’ve sent out on this subject which I guessed would be good to share

    Today on my play box I tried to load data into MyISAM table (which was
    previously dumped as mysqldump –tab)

    The data was some 1.3G, 15.000.000 rows, 512MB memory one the box.
    MySQL 4.1.8

    One could could call it trivial fast task, unfortunately I had
    unique key on varchar(128) as part of the schema.

    The load took some 3 hours before I aborted it finding out it was just
    about 20% done.

    The problem is - unique keys are always rebuilt using key_cache, which
    means we’re down to some 100-200 rows/sec as soon as index becomes
    significantly larger than memory.

    You can’t go away with ALTER TABLE DISABLE KEYS as it does not affect
    unique keys.

    Comment :: November 29, 2006 @ 12:23 am

  24. Labus,

    Yes that is the problem. As I mentioned sometime if you want to have quick build of unique/primary key you need to do ugly hack - create table without the index, load data, replace the .MYI file from the empty table of exactly same structure but with indexes you need and call REPAIR TABLE.

    REPAIR TABLE builds all indexes by sort.

    Comment :: November 29, 2006 @ 2:39 am

  25. 25. Len

    Google may use Mysql but they don’t necessarily have billions of rows - just because google uses MySQL doesn’t mean they actually use it for their search engine results. They have many little sections in their website you know. Laughably they even used PHP for one project. Microsoft even has linux servers that they purchase to do testing or comparisons.

    Comment :: January 26, 2007 @ 2:55 pm

  26. The rumors are Google is using MySQL for Adsense. Yahoo uses MySQL for about anything, of course not full text searching itself as it just does not map well to relational database.

    Comment :: January 27, 2007 @ 12:47 am

  27. @Len: not quite sure what you’re getting at…other than being obtuse.

    If you’d like to know how and what Google uses MySQL for (yes, AdSense, among other things), come to the Users Conference in April (http://mysqlconf.com). 4 Googlers are speaking there, as is Peter. Also, I don’t understand your aversion to PHP… what about using PHP is laughable? Sounds to me you are just flame-baiting…

    Comment :: February 21, 2007 @ 6:06 pm

  28. I have made an online dictionary using a MySQL query I found online. It has been working pretty well until today. Now the page loads quite slowly. I wonder how I can optimize my table. The things you wrote here are kind of difficult for me to follow.
    My table has 545 rows, id column, dutch words column, Thai characters, phonetics, is dynamic, has 59 bytes per row, auto-indexes, has a collation: utf8 unicode, data: 25,820 bytes index: 6,144 bytes and total: 31,964 bytes of data. Should I split up the data to load iit faster or use a different structure? Thanks for your suggestions.

    Comment :: February 24, 2007 @ 11:38 pm

  29. Jasper,

    Your table is not large by any means. I’d suggest you to find which query in particular got slow and post it on forums.

    Comment :: February 25, 2007 @ 4:11 am

  30. Hello peter,

    I have a table with a unique key on two columns (STRING, URL). The table contains 36 million rows (Data size 5GB, Index size 4GB).

    I run the following query, which takes 93 seconds !

    SELECT TITLE FROM GRID WHERE STRING = ’sport’;

    When I run the query below, it only takes 0.1 seconds :

    SELECT COUNT(*) FROM GRID WHERE STRING = ’sport’;

    So while the where-clause is the same, the first query takes much more time.
    The slow part of the query is thus the retrieving of the data. I came to this
    conclusion also because the query took longer the more rows were retrieved.

    Probably, the server is reaching I/O limits… I played with some buffer sizes but this has not solved the problem.. Has anyone experience with table size this large ? Is MySQL able to handle tables (MyIsam) this large ?

    Also, is it an option to split this big table in 10 smaller tables ?

    CREATE TABLE `GRID` (
    `ID` bigint(20) NOT NULL auto_increment,
    `STRING` varchar(100) character set utf8 collate utf8_unicode_ci NOT NULL default ”,
    `URL` varchar(230) character set utf8 collate utf8_unicode_ci NOT NULL default ”,
    `LANGUAGE` char(2) NOT NULL default ‘EN’,
    `COUNTRY` char(2) NOT NULL,
    `TITLE` varchar(255) character set utf8 collate utf8_unicode_ci NOT NULL default ”,
    `DESCRIPTION` text character set utf8 collate utf8_unicode_ci,
    `POINTS` decimal(10,2) NOT NULL default ‘0.00′,
    PRIMARY KEY (`ID`),
    UNIQUE KEY `string` (`STRING`,`URL`)

    Comment :: March 12, 2007 @ 10:31 am

  31. Why are you surprised ? COUNT(*) query is index covered so it is expected to be much faster as it only touches index and does sequential scan.

    Comment :: March 12, 2007 @ 11:50 am

  32. 32. Nick

    Peter,
    I have similar situation to the message system, only mine data set would be even bigger. I have several data sets and each of them would be around 90,000,000 records, but each record has just a pair of IDs as compository primary key and a text, just 3 fields.
    Number of IDs would be between 15,000 ~ 30,000 depends of which data set. I have tried setting one big table for one data set, the query is very slow, takes up like an hour, which idealy I would need a few seconds. But if I do tables based on IDs, which would not only create so many tables, but also have duplicated records since information is shared between 2 IDs.
    What would be the best way to do it?

    Comment :: March 13, 2007 @ 10:18 am

  33. 33. Jacob

    Hello,

    I am building a statistics app that will house 9-12 billion rows. I am trying to use Mysql Clustering, to the ndbcluster engine. Hardware is not an issue, that is to say I can get whatever hardware I need to do the job. Select times are reasonable, but insert times are very very very slow. As an example, in a basic config using MyISM tables I am able to insert 1million rows in about 1-2 min. The Hardware servers I am testing on are 2.4G Xeon CPU with a 1GB RAM and a Gig network. However, with ndbcluster the exact same inserts are taking more than 15 min. Anyone have any ideas on how I can make this faster?

    Comment :: April 3, 2007 @ 3:37 pm

  34. 34. Martin

    Im working on a project which will need some tables with about 200-300 million rows.
    Each row consists of 2x 64 bit integers.
    The table structure is as follows:
    CREATE TABLE `z_chains_999` (
    `startingpoint` bigint(8) unsigned NOT NULL,
    `endingpoint` bigint(8) unsigned NOT NULL,
    PRIMARY KEY (`startingpoint`,`endingpoint`)
    ) ENGINE=MyISAM DEFAULT CHARSET=utf8 ROW_FORMAT=FIXED;

    My problem is, as more rows are inserted, the longer time it takes to insert more rows.
    I insert rows in batches of 1.000.000 rows.
    First insertion takes 10 seconds, next takes 13 seconds, 15, 18, 20, 23, 25, 27 etc. I fear when it comes up to 200 million rows.
    Removing the PRIMARY KEY stops this problem, but i NEED IT.. Any suggestions what to do?
    my key_buffer is set to 1000M, but this problem already begins long before the memory is full.

    Comment :: April 6, 2007 @ 2:20 am

  35. How large is index when it becomes slower. Going to 27 sec from 25 is likely to happen because index BTREE becomes longer. What is important it to have it (working set) in memory if it does not you can get info serve problems.

    What queries are you going to run on it ? May be merge tables or partitioning will help

    Comment :: April 6, 2007 @ 12:17 pm

  36. 36. Martin

    It gets slower and slower for every 1 million rows i insert. The first 1 million row takes 10 seconds to insert, after 30 million rows, it takes 90 seconds to insert 1 million rows more.
    I need to do 2 queries on the table. 1st one (which is used the most) is “SELECT COUNT(*) FROM z_chains_999″, the second, which should only be used a few times is “SELECT * FROM z_chains_999 ORDER BY endingpoint ASC”

    I found that setting delay_key_write to 1 on the table stops this from happening. Now it remains on a steady 12 seconds every time i insert 1 million rows. What exactly is it this option does?
    After 26 million rows with this option on, it suddenly takes 520 seconds to insert the next 1 million rows.. Any idea why? Maybe the memory is full?

    Comment :: April 6, 2007 @ 3:36 pm

  37. Peter, I just stumbled upon your blog by accident. Very good info!

    I’m building an evaluation system with about 10-12 normalized tables. There are two main output tables that most of the querying will be done on. I filled the tables with 200,000 records and my query won’t even run. It’s losing connection to the db server. I tried SQL_BIG_RESULT, analyze table, etc… nothing seems to help

    Here’s my query. It does lot of counting. Does this look like a performance nightmare waiting to happen?

    Thanks,
    Will

    SELECT
    Q.questionID,
    Q.questionsetID,
    Q.question,
    Q.questioncatid,
    QAX.questionid,
    QAX.answersetid,
    ASets.answersetid,
    ASets.answersetname,
    ASAX.answersetid,
    ASAX.answerid,
    A.answerID,
    A.answername,
    A.answervalue,
    COUNT(DISTINCT e3.evalanswerID) AS totalforthisquestion,
    COUNT(DISTINCT e1.evalanswerID) AS totalforinstructor,
    (COUNT(DISTINCT e3.evalanswerID)/COUNT(DISTINCT e1.evalanswerID)*100)
    AS answerpercentage
    FROM tblquestions Q
    INNER JOIN tblquestionsanswers_x QAX USING (questionid)
    INNER JOIN tblanswersets ASets USING (answersetid)
    INNER JOIN tblanswersetsanswers_x ASAX USING (answersetid)
    INNER JOIN tblanswers A USING (answerid)
    LEFT JOIN (tblevalanswerresults e1 INNER JOIN tblevaluations e2 ON
    e1.evalid = e2.evalid
    AND e2.InstructorID = ‘1021338′

    ) ON e1.questionid = Q.questionID
    LEFT JOIN (tblevalanswerresults e3 INNER JOIN tblevaluations e4 ON
    e3.evalid = e4.evalid
    AND e4.InstructorID = ‘1021338′

    ) ON e3.questionid = Q.questionID AND
    e3.answerID = A.answerID

    GROUP BY
    Q.questionID,
    Q.questionsetID,
    Q.question,
    ASets.answersetid,
    Q.questioncatid,
    A.answerID,
    A.answername,
    A.answervalue
    HAVING Q.questioncatid = 1

    UNION
    /**The following query is just for the totals, and does not include the
    group columns**/
    SELECT
    9999,
    NULL,
    ‘Totals’,
    Q.questionID,
    Q.questionsetID,
    Q.question,
    Q.questioncatid,
    QAX.questionid,
    ASets.answersetid,
    ASets.answersetname,
    A.answerID,
    A.answername,
    A.answervalue,
    COUNT(DISTINCT e3.evalanswerID) AS totalforthisquestion,
    COUNT(DISTINCT e1.evalanswerID) AS totalforinstructor,
    (COUNT(DISTINCT e3.evalanswerID)/COUNT(DISTINCT e1.evalanswerID)*100)
    AS answerpercentage
    FROM tblquestions Q
    INNER JOIN tblquestionsanswers_x QAX USING (questionid)
    INNER JOIN tblanswersets ASets USING (answersetid)
    INNER JOIN tblanswersetsanswers_x ASAX USING (answersetid)
    INNER JOIN tblanswers A USING (answerid)
    LEFT JOIN (tblevalanswerresults e1 INNER JOIN tblevaluations e2 ON
    e1.evalid = e2.evalid
    AND e2.InstructorID = ‘1021338′

    ) ON e1.questionid = Q.questionID
    LEFT JOIN (tblevalanswerresults e3 INNER JOIN tblevaluations e4 ON
    e3.evalid = e4.evalid
    AND e4.InstructorID = ‘1021338′

    ) ON e3.questionid = Q.questionID AND
    e3.answerID = A.answerID

    GROUP BY Q.questioncatid, ASets.answersetname,A.answerID,A.answername,A.answervalue

    HAVING Q.questioncatid = 1

    Comment :: April 12, 2007 @ 7:22 pm

  38. 38. Adesh

    I am having one query

    SELECT DISTINCT spp.provider_profile_id, sp.provider_id, sp.business_name, spp.business_phone, spp.business_address1, spp.business_address2, spp.city, spp.region_id, spp.state_id, spp.rank_number, spp.zipcode, sp.sic1, sp.approved
    FROM `service_provider` sp
    INNER JOIN `service_provider_profile` spp ON sp.provider_id = spp.provider_id
    WHERE sp.approved = ‘Y’
    AND spp.master_status = ‘0′
    ORDER BY sp.business_name ASC
    LIMIT 0 , 100

    In all three tables there are more than 7 lakh record. My query doesn’t work at all
    Any solution……….?
    I m using php 5 and MySQL 4.1………

    Comment :: April 24, 2007 @ 3:51 am

  39. Hi, Im working proffesionally with postgresql and mssql and at home im using mysql for my leasure projects .. Currently Im working on a project with about 150.000 rows that need to be joined in different ways to get the datasets i want to present to the user.

    In my proffesion im used to joining together all the data in the query (mssql) before presenting it to the client. Is this wise .. i.e.

    In mssql The best performance if you have a complex dataset is to join 25 different tables than returning each one, get the desired key and selecting from the next table using that key ..

    But as I understand in mysql it’s best not to join to much .. Is this correct ..

    Comment :: May 18, 2007 @ 9:56 am

  40. 40. Fernando

    Hello Guys
    This article puzzles a bit. I have a project I have to implement with open-source software. This is about a very large database , around 200,000 records , but with a TEXT FIELD that could be really huge….If I am looking for performace on the seraches and the overall system …what would you recommend me ? (In terms of Software and hardware configuration)

    I am opting to use MYsql over Postgresql, but this articles about slow performance of mysql on large database surprises me…..

    By the way….on the other hard, Does Mysql support XML fields ?

    Comment :: May 23, 2007 @ 10:41 am

  41. This article is not about MySQL being slow at large tables. We have applications with many billions of rows and Terabytes of data in MySQL.

    This article is about typical mistakes people are doing to get their MySQL running slow with large tables.

    MySQL can store XML in Text field.

    Comment :: May 23, 2007 @ 10:51 am

  42. 42. andreas

    Hi again, Indeed, this article is about common misconfgigurations that people make .. including me .. Im used to ms sql server which out of the box is extremely fast .. But for my mysql server I’m having performance issues, så my question remains, what is the best route, join and complex queries, or several simple queries. What im asking for is what mysql does best, lookup and indexes och returning data.

    Comment :: May 23, 2007 @ 11:09 am

  43. Andreas,

    You can’t answer this question that easy. Depending on type of joins they may be slow in MySQL or may work well. A lot of simple queries generally works well but you should not abuse it.

    In general you need to spend some time experimenting with your particular tasks - basing DBMS choice on rumors you’ve read somewhere is bad idea.

    Comment :: May 23, 2007 @ 12:38 pm

  44. 44. Andreas Thörn

    Hi. I quess I have to experiment a bit, Does anyone have any good newbie tutorial configuring MySql .. My server isn’t the fastest in the world, so I was hoping to enhance performance by tweaking some parameters in the conf file, but as everybody know, tweaking without any clue how different parameters work together isn’t a good idea ..

    Comment :: May 23, 2007 @ 10:18 pm

  45. 45. Jim

    Hi, I have a table I am trying to query with 300K records which is not large relatively speaking. My problem is some of my queries take up to 5 minutes and I can’t seem to put my finger on the problem. I have tried indexes and that doesn’t seem to be the problem. My query is based on keywords. I get the keyword string then look up the id. I then use the id of the keyword to lookup the id of my record. I then build a SELECT query. My SELECT statement looks something like
    SELECT * FROM table_name WHERE (year > 2001) AND (id = 345 OR id = 654 ….. OR id = 90)
    The second set of parenthesis could have 20k+ conditions. Is there another way to approach this?
    I am running MYSQL 5.0

    Comment :: May 24, 2007 @ 9:43 am

  46. Take the * out of your select, and name the columns you need. That should improve it somewhat.

    Will

    Comment :: May 24, 2007 @ 9:50 am

  47. 47. Jim

    Will,
    I’m not using an * in my actual statement
    my actual statement looks more like
    SELECT id FROM table_name WHERE (year > 2001) AND (id = 345 OR id = 654 ….. OR id = 90)

    Even the count(*) takes over 5 minutes on some queries.

    Comment :: May 24, 2007 @ 9:54 am

  48. 48. Jim

    To answer my own question I seemed to find a solution.
    I used the IN clause and it sped my query up considerably.
    To use my example from above

    SELECT id FROM table_name WHERE (year > 2001) AND (id IN( 345,654,…, 90))

    Comment :: May 25, 2007 @ 5:54 am

  49. 49. Erick

    This article is BS. MySQL sucks on big databases, period. The big sites such as Slashdot and so forth have to use massive clusters and replication. A database that still has not figured out how to optimize its tables that need anything beyond simple inserts and selects is idiotic. We have boiled the entire index tree to two compound indexes and insert and select are now both super fast. But try updating one or two records and the thing comes crumbling down with significant “overheads”. InnoDB is suggested as an alternative. The select speed on InnoDB is painful and requires huge hardware and memory to be meaningful. Basically: we’ve moved to PostgreSQL, which is a real database and with version 8.x is fantastic with speed as well.

    Comment :: June 13, 2007 @ 3:13 am

  50. 50. Eric

    Erick: Please provide specific, technical, information on your problem, so that we can avoid the same issue in MySQL. For example, how large were your MySQL tables, system specs, how slow were your queries, what were the results of your explains, etc. Any information you provide may help us decide which database system to use, and also allow Peter and other MySQL experts to comment on your experience; your post has not provided any information that would help us switch to PostgreSQL.
    Eric

    Comment :: June 20, 2007 @ 2:13 pm

  51. 51. Erick

    Ok, here are specifics from one system. It has exactly one table. MYISAM table with the following activity:

    Per day:

    1. INSERTS: 1,000
    2. UPDATES: 200
    3. SELECTS: 1 million

    Since this is a predominantly SELECTed table, I went for MYISAM. My my.cnf variables were as follows on a 4GB RAM system, Red Hat Enterprise with dual SCSI RAID:

    [mysqld]
    long_query_time=5

    skip-bdb
    skip-innodb
    skip-locking

    query_cache_limit=1M
    query_cache_size=32M
    query_cache_type=1
    max_connections=1500
    interactive_timeout=25
    wait_timeout=10
    connect_timeout=5
    thread_cache_size=60
    key_buffer=750M
    join_buffer=10M

    max_heap_table_size=50M
    tmp_table_size=64M

    max_allowed_packet=16M
    table_cache=1800
    record_buffer=10M
    sort_buffer_size=24M
    read_buffer_size=9M
    max_connect_errors=10
    thread_concurrency=4
    myisam_sort_buffer_size=950M
    character-set-server=utf8
    default-collation=utf8_unicode_ci
    set-variable=max_connections=1500
    log_slow_queries=/var/log/mysql-slow.log
    sql-mode=TRADITIONAL
    concurrent_insert=2
    low_priority_updates=1

    The problem was that at about 3pm GMT the SELECTs from this table would take about 7-8 seconds each on a very simple query such as this:

    SELECT column2, column3 FROM table1 WHERE column1 = id;

    The index is on column1. So when I would “REPAIR TABLE table1 QUICK” at about 4pm, the above query would execute in 0.00 seconds. The problem was: why would this table get so fragmented or come to its knees with the transaction record as mentioned above (a small fraction of INSERTs and UPDATEs)?

    We explored a bunch of issues including questioning our hardware and our system administrators :) When we switched to PostgreSQL, there was no such issue. We do a “VACCUM” every *month* or so and we’re fine.

    MySQL, I have come to realize, is as good as a file system on steroids and nothing more. InnoDB doesn’t cut it for me if the backup and all of that is so very cumbersome (mysqlhotcopy is not available, for instance) and eking performance out of an InnoDB table for raw SELECT speed will take a committee of ten PhDs in RDBMS management. Not kosher.

    The above example is based on one very simple website. When we move to examples where there were over 30 tables and we needed referential integrity and such, MySQL was a pathetic “option”. Yes 5.x has included triggers, stored procedures, and such, but they’re a joke. A simple AFTER INSERT trigger takes about 7 second. What gives? Integrity checks don’t work — try making a check on a column NOT NULL to include NOT EMPTY (i.e., no blank space can be entered, which as you know, is different from NULL). We will have to do this check in the application.

    Just my experience. I know some big websites are using MySQL, but we had neither the budget to throw all that staff, or time, at it. PostgreSQL solved it for us.

    Sorry for mentioning this on a mysql performance blog. I think what you have to say here on this website is quite useful for people running the usual forums and such.

    Comment :: June 20, 2007 @ 6:14 pm

  52. 52. Erick

    I may add that this one table had 3 million rows, and growing pretty slowly given the insert rate.

    Comment :: June 20, 2007 @ 6:38 pm

  53. [...] building summary tables and tricks of executing queries exactly as you want. I already wrote about development and configuration side of the problem so I will not go to details [...]

    Pingback :: July 5, 2007 @ 11:05 am

  54. I’ve read SHE-DBA’s blog on using MySQL as a data-warehousing platform and where it _can_ be used as such provided the design is good and the queries are optimised.

    I work with MSSQL and multi-GB tables (up to 1 TB) with some very ridiculous normalisation and absurd number of columns (some tables have > 900!!) and the number of rows are also in the couple millions.

    I do data-mining of these data and I’m(we actually) getting reprimanded for sending queries that take a long time to finish. (30min up to 2/3 hours). My queries are complex and involves a quite a few joins (due to the normalisation) and multiple subqueries (due to nature of the data). (This is on a 8x Intel Box w/ mutli GB ram)

    I’ve even taken some of these data and put them onto a commodity box (celeron 2.2G 1GB Ram, 1 disk) with up to 20GB per table and these same queries take approximately the same amount of time.

    Due to the usage of subqueries, I think this may be the main cause of the slowness. Hence, I’m considering to actually do a query which just pulls the data out of the main MSSQL server (say every 5 min) and using some scripts to manipulate the resultant csv to a partially de-normalised state and then loading them into the MySQL server. I suspect that this may make things faster. (At least I don’t have to join so many tables to get the data)

    The other thing which I’m considering is to just pull these data direct into users’ PC, dumping it into a sqlite DB and do the joins there, thereby minimising load on the main server or even the mysql server. (running on 5.0, so I can’t even play with partitioning)

    Do you have any thoughts on this? Seems like we’re going in circles with these.

    Everything on Mainframe. –> Processed data to the users PC
    pull from server(select a,b,c from table) -> process data at users PC (due to increase in computational power of current PC) (select a, join b, where c…)

    eg:
    select
    a.a,
    b.x,
    etc..

    from
    (select

    ..
    ) a
    left outer join
    (select

    ) b
    on b.a = a.a
    etc..

    (in total I have like close to 11 subqueries joining a min of 4 tables at each instance)

    Ps : reading Eric/k statement, perhaps PostGres could better handle things?

    As it is, right now, I’m exploring the option of dumping the data into users’s PC into SQLite and doing the processing there so that users PC will be un-disposed and not the server.

    My current issue is how to get it out of the MSSQL server and into the SQLite DB. Anyway… not the correct forum/blog

    Comment :: July 6, 2007 @ 6:47 am

  55. 55. Alain

    Hi, I’m having a similar challenge with a table that is storing files for a web application. The table structure is as following and it stores uploaded files in many portions (fragments) of max. 4MB each:
    CREATE TABLE `files` (
    `fragmentid` int(10) unsigned NOT NULL default ‘0′,
    `data` mediumblob NOT NULL,
    `pid` int(10) unsigned NOT NULL default ‘0′
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1;

    The “fragmentid” species a part of the file (data), and the pid specifies the id of the relating record in another table with some meta data. The ‘data’ attribute contents the binary fragments. Building the entire file at the end consists of just putting all the fragments in the right order together. I also have to add, that once a file has been retrieved on request, the content is cached on the file system, in a manner that calling that file afterwards do not require a query, unless the cached file is deleted.

    Meanwhile the table has a size of 1,5Gb. The queries that were taking less than 1 second some times ago are taking at least 20 to 30 seconds. I’ve read the differents comments from this and other forums. It seems like it will be more efficient, to split the tables i.e. in a manner that the table size remain in a good range for fast queries.

    Can anybody here advice me, how to proceed, maybe someone, who already have experienced this. Tnx.

    Comment :: September 10, 2007 @ 2:34 am

  56. 56. Ray

    I am having a problem with updating records in a table. We have aprox 14,000,000 records using over 16gigs storage. This table stores (among other things) ID, Cat, Description, LastModified. The engine is InnoDB.

    I am having a problem when I try to “prune” old data. This query takes about 45 minutes to execute (DELETE FROM Data WHERE Cat=’1021′ AND LastModified

    Comment :: October 16, 2007 @ 10:47 am

  57. 57. Ray

    I am having a problem with updating records in a table. We have aprox 14,000,000 records using over 16gigs storage. This table stores (among other things) ID, Cat, Description, LastModified. The engine is InnoDB.

    I am having a problem when I try to “prune” old data. This query takes about 45 minutes to execute (DELETE FROM Data WHERE Cat=’1021′ AND LastModified < ‘2007-08-31 15:48:00′)

    Each Cat may or may not be updated. I am trying to prune a Cat after updating the Cat to clear out any records that were not updated (hence deleted from our files). There are aprox 900 different Cat values (all integers).

    We’ve tried to put a combined index on Cat and LastModified. This did not seem to speed it up any.

    I attempted to use Sphinx to pull the ID’s of all records in a Cat, then lookup the LastModified from MySQL. This did not seem to help anything. (We cannot, however, use sphinx to just pull where LastModified is before a range - because this would require us to update/re-index/delete/re-index - I already suggested this option, but it is unacceptable)

    We are at a loss here. Any help would be appreciated!

    Comment :: October 16, 2007 @ 10:52 am

  58. 58. Jim Burke

    Indexes end up becoming a liability when updating a table.. The indexes need to be rewritten after each update slowing the process down. Indexes help speed up the look up phase of a record(s). Try minimizing the number of indexes while running a big process like this one, then reapplying the indexes afterward. How long does it take to get a SELECT COUNT(*) using the conditions used in your DELETE statement? My guess is not 45 minutes.

    Comment :: October 16, 2007 @ 11:00 am

  59. 59. hari

    This is hari, Here is my work…….there will be so many so many machines(lathe,crushing),which they are operated at different places with PLC boards.
    Some operators will control the machines by varying the values in the plc board.We need to collect that values from those machines via wireless communication and store that values into the database server.We need to observe that ,the operator operating the machines correctly or not at server place.Here problem is how we have to create the database for dynamic data. That is operator can change his entire table of data (values) at any point of time. With that variation also database need to be updated with these values(that is old table of values need to be replaced with new values when ever change occurs at the machine) please observe that, we need to replace entire table of values with some other values not a single row .Please give me the structure to represent this database(dynamic data).

    Comment :: October 24, 2007 @ 5:22 am

  60. 60. Chandan

    Good article.

    Comment :: November 17, 2007 @ 1:08 am

  61. Link Attack I

    OSWD: Two Designs

    Open Source Web Design
    If your mainly a programmer, but would love to turn out sites with a bit more fit and finish, this site is for you. They provide packaged “site designs” which include sample HTML, css and images in a ready…

    Trackback :: December 16, 2007 @ 12:41 am

  62. 62. Hugo Rafael

    Hi!..
    I’m currently working on a web project using MySql, Apache and Php.
    The database has a relatively acceptable size, not only in number of tables, but also in some table sizes.
    The problem I have, is regarding some specific tables in the database, which I use for a couple of months duration, minning them with detailed data of a particular task. After those months pass, I’ll drop all those tables, and rebuild them once again for another couple of months work.

    I’m currently at design time, and at half the specified month period, and so far, the tables are about 10GB in size each (so far, 6 in total).
    SELECTing data from the tables is not a problem, and it’s quite fast (<1 sec.), which is what it’s intended, but INSERTing in them is a nightmare.

    I use multiple record INSERTs (150 each time), and the first INSERT usually takes 600 secs (when the task starts). After this one is INSERTed I can easily INSERT 100.000 (in groups of 150) in less than 10 minutes.

    The question I have, is why is this happening, and if there is any kind of query I can preform in order to “focus” the DBMS “attention” to the particular table (in context), since SELECTing data is always faster then INSERTing it.

    The table spec is the follows:
    CREATE TABLE IF NOT EXISTS TableName (
    A INT(10) UNSIGNED NOT NULL,
    B INT(10) UNSIGNED NOT NULL,
    C TINYINT(3) UNSIGNED NOT NULL DEFAULT ‘0′,
    D TINYINT(3) UNSIGNED NOT NULL DEFAULT ‘1′,
    E TINYINT(3) UNSIGNED NOT NULL DEFAULT ‘0′,
    a TINYINT(3) UNSIGNED NOT NULL DEFAULT ‘0′,
    b TINYINT(3) UNSIGNED NOT NULL DEFAULT ‘0′,
    c TINYINT(3) UNSIGNED NOT NULL DEFAULT ‘0′,
    d TINYINT(3) NOT NULL DEFAULT ‘0′,
    e TINYINT(3) UNSIGNED NOT NULL DEFAULT ‘0′,
    PRIMARY KEY (A,B,C,D),
    KEY (E),
    CONSTRAINT key_A FOREIGN KEY (A) REFERENCES ATable(A) ON DELETE NO ACTION ON UPDATE NO ACTION,
    CONSTRAINT key_B FOREIGN KEY (B) REFERENCES BTable(B) ON DELETE NO ACTION ON UPDATE NO ACTION
    ) ENGINE=InnoDB;

    The capital letters are INDEXes and some reference columns in other tables, and the small letters are just data columns that are returned but never used as filters (WHERE/JOIN/ORDER CLAUSEs).
    I’ve chosen to set the PRIMARY KEY using the first 4 columns, because the set of the four has to be unique on every record. I’ve used a different KEY for the last capital letter column, because I use that column to filter in the table.
    I use a group of these tables in the system, and preform simple SELECTs in them (not joining them with other tables).
    Once again, I use multiple inserts into the table, with sets of 150 records each:
    INSERT INTO T (col1,col2) VALUES (1,1),(1,2),(1,3)…(1,150);

    Is this the best design for the table, taking in consideration what I want from it?!
    Could the INSERTs be slow, dued to the size of the PRIMARY KEY?! Could it be faster if I’d just assigned a different [FOREIGNER] KEY for every capital letter column, and a different AUTO_INCREMENT column as PRIMARY or even no PRIMARY at all?!

    Cheers for your time in advance.

    Comment :: January 31, 2008 @ 2:21 pm

  63. 63. Jim

    In general without looking at your ini file and having no knowledge of your system it would be your indexes. The more indexes you have the faster SELECT statments are, but the slower INSERTS and DELETES. This is because the indexes have to be rewritten everytime you make a change. You can tweak memory usage in your ini file or add memory or processors to your computer. the type of DB you are using for the job can be a huge contributing factor for example Innodb vs MyISAM. You need to weigh the pros and cons of all these factors against your situation and then make your own educated decision.

    Comment :: February 1, 2008 @ 9:33 am

  64. 64. Hugo Rafael

    In your opinion, is better policy to have a separate index for each column I search, or a combined index for all the columns?

    Cheers.

    Comment :: February 1, 2008 @ 4:54 pm

  65. 65. Jim Burke

    It depends on your searches. MySQL will not combine indexes, only choose the best one it sees fit. There for if you do a search on 2 columns a lot then I would have a separate index to speed up that query. Indexes are great and the more the better, but if you want to bulk upload records then you will suffer the consequences. Sometimes if I have many indexes and need to do bulk inserts or deletes then I will kill the indexes, run my process and then recreate my indexes afterward. Hope this helps.

    Comment :: February 2, 2008 @ 5:48 pm

  66. 66. Hugo Rafael

    I didn’t said I wanted to combine indexes, I was talking about a combined index.
    Is it better to have:
    INDEX (’col1′,’col2′,’col3′);
    or:
    INDEX1 (’col1′), INDEX2 (’col2′), INDEX3 (’col3′)
    ?!

    The reason I’m asking is that I’ll be inserting loads of data at the same time, and the insert has to be relatively quick. On the other hand, I’ll need to select the data even quicker. The table has hundreds of millions of records.

    Cheers.

    Comment :: February 2, 2008 @ 7:01 pm

  67. 67. Jim

    Sorry for the confusion, but this is what I was talking about also. If you need to search on col1, col2, col3 then create an index(col1,col2,col3). What I was saying is if you have an index(col1), index(col2), and index(col3) and you run a query and want it optimized, the MySQL engine will not combine the second example to give you index(col1,col2,col3) for a more optimized query. It will only pick index(col1) or index(col2), index(col3) or none of the above. It doesn’t take any longer to create any one of these indexes, but I’m sure you know all this. The trick I use, because I have one database that is fairly large with many indexes, is to drop the indexes, then do the bulk up load, and then recreate the indexes.
    I hope this helps.

    Comment :: February 4, 2008 @ 6:14 am

  68. 68. Hugo Rafael

    It does help, cheers.
    Dropping the index is out of the question, since dropping them and creating them takes far too much time, being even quicker to just let them be.
    I think the answer to this, is just drop the PK and FK’s, and create a normal index with the two main searchable columns. Although this index seams to be a bit slower, I think it might be quicker on large inserts on the table.
    I suppose that I’ll have to break the table up, as well, in order to have all the data in smaller tables and smaller indexes.

    I was having indexes almost the size of the complete table (+/- 5GB), which made the whole table around 10GB. I have been playing with different indexes and at this time I managed to drop the index’s size to up 1.5GB, which is much more acceptable. Although the selects now take 25% more time to perform, it’s still around 1 second, so it seams quite acceptable to me, since there are more than 100 million records in the table, and if it means that the inserts are faster.

    Cheers for your help.

    Comment :: February 4, 2008 @ 8:23 am

  69. 69. xeeshaan

    Best Practice to deal with large DBs is to use a Partitioning Scheme on your DB after doing a thorough analysis of your Queries and your application requirements. Since 5.1 support Data Partitioning, I am using the scheme over a Huge DB of Call Details records which is growing as 20M (approximately 2.5GB in size) records per day and I have found it an appropriate solution to my Large DB issues.

    As for Joins, its always best practice not to use joins over Large Tables. Instead use alternate Sub-queries to Joins where possible and with the use of Partitioning make subsets of your data and bring them in Heap-tables rather than storing them on Disk and perform your operations making sub-tasks of Task.

    Remember when Anaconda eats a deer it always take time to get it right in itss stomach. So give your Anaconda small pieces of meat than full deer all in once.

    Comment :: May 1, 2008 @ 1:45 am

  70. 70. Hugo Rafael

    That seems to be the solution. I had already found that solution on MySql web site, although 5.1 is still not stable according to them, and 5.0 doesn’t support partitioning.
    I’ll have to do it like that, and even partitioning over more than one disk in order to distribute disk usage.

    Cheers.

    Comment :: May 1, 2008 @ 10:06 am

  71. Hi guys,
    I have recently written an article, “Full Text Partitioning - The Ultimate Scalability”, and my article can be found at the url below:

    http://www.addedworth.com/index.php/2008/06/03/full-text-partitioning-the-ultimate-scal

    This article describes the steps to take when a database is spilled over to more than a single server. Do come by my site and let me know your opinion.

    Thanks,
    YC Wee

    Comment :: June 3, 2008 @ 1:52 am

  72. Normally MySQL is rather fast loading data in MyISAM table, but there is exception, which is when it can’t rebuild indexes by sort but builds them
    row by row instead. It can be happening due to wrong configuration (ie too small myisam_max_sort_file_size or myisam_max_extra_sort_file_size) or
    it could be just lack of optimization, if you’re having large (does not fit in memory) PRIMARY or UNIQUE indexes.

    Comment :: June 17, 2008 @ 5:51 am

 

Subscribe without commenting


This page was found by: mysql slow google uses mysql how to optimize join... mysql + big tables insert slow on large...