Few days ago I was working on a case where we needed to modify a lot of data before pushing it to sphinx – MySQL did not have a function to do the thing so I thought I’ll write MySQL Stored Function and we’ll be good to go. It worked! But not so well really – building the index, which was taking 10 minutes, was now taking 16 minutes. Then we added another MySQL function for different set of attributes and indexing speed went from 16 minutes to 26 minutes. I knew using UDF would be faster, but I had no idea how much. Have you ever wondered?
So what were the modifications we needed? It was couple very simple things – (1) two varchar columns needed leading nonalpha characters trimmed, so “123 ^&* and some text” would become “and some text”, and (2) same two varchar columns needed some double characters changed to single one so “Picasso” becomes “Picaso”, “Wesselmann” becomes “Weselman” and so on. Why we needed that is another story which this blog post is not about. Note however that only very small portion of data really needed to be modified.
Here are the two MySQL functions I wrote to do the job – ltrim_junk_mysql() and remove_dups_mysql(). Although processing single row seemed to be instantaneous, we needed to process much more than that – and that wasn’t as fast. For example, here’s how long it took to process 100k rows:
1 2 3 4 5 | mysql> select ltrim_junk_mysql(author), ltrim_junk_mysql(title) from paintings limit 100000; 100000 rows in set (2.97 sec) mysql> select remove_dups_mysql(author), remove_dups_mysql(title) from paintings limit 100000; 100000 rows in set (2.04 sec) |
If you looked carefully at the second function though, you may have noticed I did not necessarily have to write a function, I could have written it as an SQL statement:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | mysql> select REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( LOWER(author), 'aa', 'a'), 'bb', 'b'), 'cc', 'c'), 'dd', 'd'), 'ff', 'f'), 'gg', 'g'), 'll', 'l'), 'mm', 'm'), 'nn', 'n'), 'oo', 'o'), 'pp', 'p'), 'rr', 'r'), 'ss', 's'), 'tt', 't'), 'vv', 'v'), 'zz', 'z'), REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( REPLACE( LOWER(title), 'aa', 'a'), 'bb', 'b'), 'cc', 'c'), 'dd', 'd'), 'ff', 'f'), 'gg', 'g'), 'll', 'l'), 'mm', 'm'), 'nn', 'n'), 'oo', 'o'), 'pp', 'p'), 'rr', 'r'), 'ss', 's'), 'tt', 't'), 'vv', 'v'), 'zz', 'z') FROM paintings LIMIT 100000; 100000 rows in set (0.33 sec) |
Doesn’t look nice, but it already executes more than 6 times faster which is interesting as it shows how much overhead you have by using mysql stored routines interface. So anyway, I asked my colleague Sasha to help me out by rewriting these as UDF functions. Here’s ltrim_junk() function and remove_dups(). Well, guess what:
1 2 3 4 5 | mysql> select ltrim_junk(author), ltrim_junk(title) from paintings limit 100000; 100000 rows in set (0.13 sec) mysql> select remove_dups(author), remove_dups(title) from paintings limit 100000; 100000 rows in set (0.17 sec) |
So for ltrim_junk() function we got almost 23x improvement and for remove_dups – 12 times if comparing to stored function or 2 times comparing to just using available functions. With that speed I could even scan the whole table of 7 million records:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | mysql> select count(*) from paintings where title != ltrim_junk(title); +----------+ | count(*) | +----------+ | 101533 | +----------+ 1 row in set (6.82 sec) mysql> select count(*) from paintings where author != ltrim_junk(author); +----------+ | count(*) | +----------+ | 28335 | +----------+ 1 row in set (6.63 sec) mysql> select count(*) from paintings where author != remove_dups(author) OR title != remove_dups(title); +----------+ | count(*) | +----------+ | 2720414 | +----------+ 1 row in set (11.19 sec) |
Whereas using stored function used to take minutes!
I don’t mean to say stored functions are bad and you should now rewrite all your functions as UDFs – if you need to process just a few records for a request and you are not burning racks of CPUs to constantly do the job, the speed difference is really negligible. However in case like this one where we have to process many records constantly and every second counts, UDF can really save your day. If you need one and don’t feel confident writing C, you know who to call!
If you need one and don’t feel confident writing C, you know who to call!
The P-Team!
Hi!
SQL based functions are really only good for flexibility, in very few cases (and almost all of them deal with putting business logic inside of your databases) will you get good results from using them.
Cheers,
-Brian
Hi Aurimas,
I have the same experience. UDF functions are way faster. This is hardly surprising for little-memory consuming functions: compiled gcc code is *way* faster than interpreted SQL code; there can be no competition here.
It does become questionable, though, when large amounts of memory are used. I had a function which, like yours, need to change some text. Only the size of the text was up to 100KB.
The copying of memory from MySQL to UDF and back was such an overhead, that the UDF ran *much* slower than the stored routine, by magnitudes.
Shlomi, Brian –
I totally agree and it’s kind of obvious, but it was really interesting to see exactly how much slower it is in few very simple scenarios, so I could actually tell what difference does it make, not just that it will be much slower. It is also interesting in a case Shlomi mentioned it could be so much slower!
Aurimas
Yeah, this can be quite a surprise for people who are used to much faster stored code execution in other databases such as SQL Server.
[…] A significant advantage of UDF’s over stored functions is besides the opportunity to define global server functions, that especially string functions are from a performance point of view far superior – how it was already proven from the guys of Percona. […]
Hi all,
I know this is a very old blog, but is it possible to re-post it properly? All of the code chunks have been posted nested inside each other, instead of sequentially, and I’m finding it impossible to read the narrative in-between!
Cheers, Sam.
Sam –
thanks, should look better now. We had an incident couple of years ago after which we had to restore blog posts from external sources and apparently not everything recovered as nicely as we wished to.
Aurimas