August 2, 2014

Watch out for your CRON jobs

Resolving extreme database overload for the customer recently I have found about 80 copies of same cron job running hammering the database. This number is rather extreme typically the affect is noticed and fixed well before that but the problem with run away cron jobs is way to frequent.

If slow down happens on the database server or job takes longer to run it often can’t complete before its time for it to run again and unless prevented the second copy will run, which will have to compete with first copy for resources so having even less chance to finish. I leave the question of what effect on results running multiple cron jobs at the time may have.

Here are few practices which should help you to keep your cron jobs under control.

Prevent running multiple copies This is the most important one. I would suggest you having “production requirement” of no cron jobs allowed unless they prevent themselves from started in multiple copies or put a wrapper script around developer written jobs when you put them in production.

This can be very well done using file locks (do not just create files, files left after script crash can prevent it from starting again) or using GET_LOCK function in MySQL. The second one is good if you want to serialize jobs from multiple servers (for example you specially put script in CRON job on 2 servers for High Availability Purposes). It is also helpful if you want to limit concurrency for certain processes – like if you have 50 web servers which run certain cron jobs, but you do not want more than 4 of them to run at once.

Watch for errors Cron has a powerful feature of mailing you the output. If you make script to be silent and only print error messages (best to stderr) you can catch when problems start to happen – for example if job failed to run because other one is running which is not expected in your system. In large systems you may approach this problem differently to avoid hundreds of cron error messages when you restart database server etc but information about cron errors should find you.

Store Historical Run Times In a lot of cases when CRON job can’t complete in time any more I’m wondering of it happened over night or if it was taking more and more time gradually until it could not complete in time. Create the table in the database and store information about how long cron took to run. It can also be done by wrapper script but best if it is done internally as you can store some other metrics as well. For example you can store something like script took 40 seconds and processed 4000 images. In this case you would see if slow down happens because the amount of “work” increases or because system gets slower in processing this work. It is good if you can hook up monitoring to this trending data as well. For example you may get an alert if CRON job which runs every 5 minutes and normally completes in 30 seconds took more than 2 minutes to run.

Finally I would like to share the script I used for exclusion of running CRON jobs in the simple cases when I did not want to run any extra code to do it in the script itself.

About Peter Zaitsev

Peter managed the High Performance Group within MySQL until 2006, when he founded Percona. Peter has a Master's Degree in Computer Science and is an expert in database kernels, computer hardware, and application scaling.

Comments

  1. Louis says:

    If anyone wonders, this can be achieved very easily in Perl with the module Proc::PID::File:

    use Proc::PID::File;
    exit if Proc::PID::File->running();

    Documentation can be found at http://search.cpan.org/dist/Proc-PID-File/File.pm

  2. pplu says:

    Hi,

    You can always take a look at http://www.fatalmind.com/software/hatools/. It comes to be your locking script on steroids :D

  3. Patrick says:

    We’ve been using lockrun (http://unixwiz.net/tools/lockrun.html) to achieve the same thing, I’ll have to check out the other options.

  4. I would caution against using file-based locking. In my experience locking/exclusion code is likely to be integrated into the codebase, and then it becomes ugly to fix when cron jobs are moved onto different servers. I have seen this happen twice in my short career. Coordinating work across more than one server is a pretty basic requirement in most environments. Using a centralized lock manager from the start is a much better idea, unless you are Google. GET_LOCK() tends to work pretty nicely in my experience.

  5. peter says:

    Baron,

    I think a lot depends on what the script is doing. If the script is run in only one instance globally file locking can work quite fine – just keep lock in /tmp/lockname or something similar.

    There are benefits and drawbacks of file locks for sure. For example someone or something may remove the file causing other script to get the same lock. It is much harder to “steal” the lock which is taken by GET_LOCK. On other hand if MySQL server goes away or connection is closed the GET_LOCK lock is gone while lock on local file system should not be lost due to any network connection issues etc. Of course if main thing script does is database related GET_LOCK is great because if you lose connection you have aborted the script anyway.

    Whatever locking method you chose it is the most important you have one and it works stable for the task.

  6. I was never fond of file-based locking because from my experience it behaves differently between different os’s and file systems. It’s a quick and dirty recipe for a headache if misused, in contrast to what can be achieved with mysql’s named locks (GET_LOCK).

    We had scenario some years back where cron job would periodically fire up 50 instances of php processes (which used mysql as a job queue and a job manager) to do datamining. Prevention of running multiple copies was handled via job_is_running, retry_count etc switches in jobs table and there were surprisingly low count of jobs which got stuck. Those who would get stuck, we would clenup on timeout and eventually rerun.

    Now, I would do this with http://gearman.org/ or something similar.

    At the end it all breaks down on your specific problem which you need to solve and tools you have at your disposal.

  7. joy says:

    Hi Peter,

    Do you think possible to use semaphore http://us.php.net/manual/en/function.sem-acquire.php rather than flock() ?

  8. imran says:

    Most crons that i personally write are in either python or bash. In python, i wrote myself up a lib that basically acts as a ps grep of sorts and extracts the date of execution, status of the process and the process name. i’ve used the lib widely when determining whether my cron process is already running or not (or for that matter suspended/crashed so I can send a notice out). The same lib has proved very useful for writing start/stop frontends as well.

  9. peter says:

    Neven,

    Right. There are possible issues. Though a lot of systems out there even at the large scale would run only one operation system or may be the couple.

    The GearMan is indeed great solution but I think it is overkill for a lot of simple tasks.

  10. peter says:

    Joy,

    Yes. I see no problem with using semaphores if you prefer. I do not like it as much as it is harder to see the status of semaphores using command line tools.

  11. I’m surprised you didn’t recommend using MySQL events — many people are on 5.0 or higher and can use them. Errors are put into the error log (which I recommend monitoring, at Pythian we monitor error logs, and we can specify ‘noise’ to ignore like if there’s a lot of “aborted connects”, so we see the important things), so that issues is resolved.

    In fact, the times that the event started and finished are also logged in the error log, so there is a historical log of how long things take.

  12. peter says:

    Sheeri,

    I am mainly speaking about shell/perl/php/python/ruby scripts a lot of systems are running. If you just need simple query/stored procedure run you can use MySQL events of course.

  13. Fourat says:

    A more secure way to lock/unlock your script is to use register_shutdown_function() at the begin to register the action to be done on the end.

  14. Peter,

    Many people have shell/perl/php/python/ruby scripts that are cron jobs that only do database reads/writes, and can be converted to MySQL Events.

    Obviously if there are actions outside MySQL, it’s not useful as a MySQL Event.

    Pythian actually mitigates this issue by monitoring long-running OS processes.

  15. XCondE says:

    man 1 flock
    flock [-sxon] [-w timeout] lockfile [-c] command…

    */5 * * * * flock -xn /var/lock/mylockfile -c my-shell-script.sh

    Only one will ever run. You can also tell it to wait instead of bailing out:

    */5 * * * * flock -x /var/lock/mylockfile -c my-shell-script.sh

    Or wait only 30 secs:

    */5 * * * * flock -x -w 30 /var/lock/mylockfile -c my-shell-script.sh

Speak Your Mind

*