Few days ago I was working on a case where we needed to modify a lot of data before pushing it to sphinx – MySQL did not have a function to do the thing so I thought I’ll write MySQL Stored Function and we’ll be good to go. It worked! But not so well really – building the index, which was taking 10 minutes, was now taking 16 minutes. Then we added another MySQL function for different set of attributes and indexing speed went from 16 minutes to 26 minutes. I knew using UDF would be faster, but I had no idea how much. Have you ever wondered?

So what were the modifications we needed? It was couple very simple things – (1) two varchar columns needed leading nonalpha characters trimmed, so “123 ^&* and some text” would become “and some text”, and (2) same two varchar columns needed some double characters changed to single one so “Picasso” becomes “Picaso”, “Wesselmann” becomes “Weselman” and so on. Why we needed that is another story which this blog post is not about. Note however that only very small portion of data really needed to be modified.

Here are the two MySQL functions I wrote to do the job – ltrim_junk_mysql() and remove_dups_mysql(). Although processing single row seemed to be instantaneous, we needed to process much more than that – and that wasn’t as fast. For example, here’s how long it took to process 100k rows:

If you looked carefully at the second function though, you may have noticed I did not necessarily have to write a function, I could have written it as an SQL statement:

Doesn’t look nice, but it already executes more than 6 times faster which is interesting as it shows how much overhead you have by using mysql stored routines interface. So anyway, I asked my colleague Sasha to help me out by rewriting these as UDF functions. Here’s ltrim_junk() function and remove_dups(). Well, guess what:

So for ltrim_junk() function we got almost 23x improvement and for remove_dups – 12 times if comparing to stored function or 2 times comparing to just using available functions. With that speed I could even scan the whole table of 7 million records:

Whereas using stored function used to take minutes!

I don’t mean to say stored functions are bad and you should now rewrite all your functions as UDFs – if you need to process just a few records for a request and you are not burning racks of CPUs to constantly do the job, the speed difference is really negligible. However in case like this one where we have to process many records constantly and every second counts, UDF can really save your day. If you need one and don’t feel confident writing C, you know who to call!

8 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Murdoc

If you need one and don’t feel confident writing C, you know who to call!
The P-Team!

Hi!

SQL based functions are really only good for flexibility, in very few cases (and almost all of them deal with putting business logic inside of your databases) will you get good results from using them.

Cheers,
-Brian

Shlomi Noach

Hi Aurimas,
I have the same experience. UDF functions are way faster. This is hardly surprising for little-memory consuming functions: compiled gcc code is *way* faster than interpreted SQL code; there can be no competition here.
It does become questionable, though, when large amounts of memory are used. I had a function which, like yours, need to change some text. Only the size of the text was up to 100KB.
The copying of memory from MySQL to UDF and back was such an overhead, that the UDF ran *much* slower than the stored routine, by magnitudes.

Baron Schwartz

Yeah, this can be quite a surprise for people who are used to much faster stored code execution in other databases such as SQL Server.

MySQL My new playground « Web Design Library

[…] A significant advantage of UDF’s over stored functions is besides the opportunity to define global server functions, that especially string functions are from a performance point of view far superior – how it was already proven from the guys of Percona. […]

Sam Johnson

Hi all,

I know this is a very old blog, but is it possible to re-post it properly? All of the code chunks have been posted nested inside each other, instead of sequentially, and I’m finding it impossible to read the narrative in-between!

Cheers, Sam.