Skip to content

Prevent Overlapping Cron Jobs with flock

flockcronlockingconcurrencysysadmin
5 min read
Matching toolCron Job Builder

Quick Answer

A cron job that overlaps itself is one of the quietest ways to take a server down. Schedule a sync every minute, hit one slow-disk day, and you suddenly have eight copies fighting over the same files. flock fixes this with a kernel-held lock on a file descriptor: open a lock file on a dedicated fd, call flock -n on it, and bail if another run already holds it. Because the lock lives on the open descriptor — not on the file existing — the kernel releases it automatically when the process exits, even on a crash or kill -9, so you never get the stale locks that plague PID-file checks. Use -n to skip this run when a previous one is still going, or -w to wait a set number of seconds. Keep the lock file in /run/lock, never /tmp, where temp-cleaners delete it mid-run.

A */1 rsync took our staging box to a load average of 41 one afternoon, and it took me longer than I want to admit to understand why. The sync normally finished in about twenty seconds. That day the backup target's NFS mount went sluggish, the sync started taking ninety seconds, and cron — which does not know or care whether the last run is still going — launched a fresh copy every single minute on top of it. Inside ten minutes there were a half-dozen rsyncs all reading the same tree off the same slow disk, each one making the disk slower, each new minute adding another. The box wasn't under attack. It was attacking itself, one polite copy at a time.

The fix is to make the job refuse to run while a copy of itself is already running. People reach for a PID file first — write $$ to /var/run/job.pid, check if it exists on the next run — and it almost works, until the day a run gets kill -9'd or the box reboots mid-job and leaves a stale PID file behind. Now every future run sees a "lock" pointing at a process that died on Tuesday, and the job never runs again. There's also a quiet race between checking the file and writing it. PID files are a lock you have to remember to clean up, and the times you most need the lock are exactly the times cleanup didn't happen.

flock has none of that. The lock isn't a file you create and delete — it's a lock the kernel holds on an open file descriptor, and the kernel releases it automatically the instant that descriptor closes. The process exiting closes the descriptor. So does crashing. So does kill -9. There is no state to leave behind.

The single-instance pattern

bash
#!/bin/bash # Script: backup-with-lock.sh # Purpose: Stop a cron job from overlapping itself when one run runs long # Usage: backup-with-lock.sh (self-locking — safe to call from any */N cron) set -euo pipefail CHECK="✓" CROSS="✗" # /run/lock is tmpfs, cleared cleanly on reboot. Never /tmp — temp-cleaners # delete files there, and a deleted lock mid-run lets a second copy run. LOCK_FILE="/run/lock/$(basename "$0").lock" # The > opens (and creates) the lock file on fd 200 and holds it open for the # whole script. The LOCK lives on this open descriptor, not on the file # existing — so never rm the file to "release" it. exec 200>"$LOCK_FILE" # -n = non-blocking: if a previous run still holds the lock, give up now # instead of queueing another copy behind it. if ! flock -n 200; then echo "$CROSS $(date '+%F %T') previous run still active — skipping" >&2 exit 0 fi echo "$CHECK $(date '+%F %T') lock acquired — starting" # --- the real work --- rsync -a --delete /data/ /mnt/backup/data/ echo "$CHECK $(date '+%F %T') finished — kernel releases the lock on exit"

The two lines doing the work are exec 200>"$LOCK_FILE" and flock -n 200. The first opens the lock file on a descriptor that stays open for the life of the process. The second tries to grab the lock without waiting; if a sibling process already holds it, flock returns non-zero, we log it to stderr and exit 0 — a skipped run is normal, not an error, so we don't want it lighting up cron's mail.

Notice there is no cleanup. No trap to remove a PID file, no rm at the end. When this script exits for any reason, fd 200 closes and the lock is gone. That "for any reason" is the whole point: the failure modes that strand a PID file are the ones flock shrugs off.

If you don't want to edit the script

You can lock a command without touching it at all, straight from the crontab line:

bash
# Skip the run if the last one is still going */1 * * * * /usr/bin/flock -n /run/lock/sync.lock /usr/local/bin/sync.sh

flock runs sync.sh only if it can grab /run/lock/sync.lock; if a previous minute's run is still holding it, this minute's run exits immediately and does nothing. This is the fastest retrofit for a job that's already misbehaving — you don't even have to redeploy the script.

-n skips; -w 30 waits up to 30 seconds then gives up. Pick -n for frequent jobs where a skipped run is harmless (a metrics push, a sync that catches up next minute), and -w for jobs that must eventually run but can tolerate a short queue. Never use a bare flock with no -n and no -w on a fast cron — that blocks forever, and your "skipped" runs quietly pile up as stuck processes.

The load-41 afternoon ended the moment I wrapped that rsync in flock -n. The slow NFS mount was still slow, but now exactly one sync ran at a time and the extras skipped harmlessly until the disk recovered. The interesting part is that locking didn't fix the slow disk — it stopped a transient slow disk from turning into a self-inflicted outage. That's the difference between a script that works when you run it and a script that survives unattended.

A lock alone isn't enough, though. If the locked job itself hangs — a sync blocked on a dead socket that never returns — it holds the lock forever and every future run skips, so the job silently stops running and you find out days later. That's why locking pairs with bounding a command's runtime with timeout, and with retrying transient failures so a single network blip doesn't kill the run. The Hardened Cron Wrapper Generator stitches all three together for you, and the full reasoning for which jobs need which guard is in Bash Scripts That Survive Cron.

Run this script on a real Linux server

Get $200 free credit — DigitalOcean

Get $200 Free →

Affiliate link · we earn a commission

Need to test this on a real box without risking a production server? Spin up a throwaway droplet, schedule the job at */1, and watch flock skip the overlaps. The rest of the unattended-automation toolkit is at bashsnippets.xyz — the Cron Job Builder for the schedule itself, and Bash error handling for the layer underneath all of this.

BashSnippets logo

Written by Anguishe

Creator of BashSnippets.xyz

bashsnippets.xyz/about

Run this script on a real Linux server

Get $200 free credit — DigitalOcean

Get $200 Free →

Affiliate link · we earn a commission

Need a domain for your next project?

Register with Namecheap — free WHOIS privacy included

Check Domain Prices →

Affiliate link · we earn a commission

PAID RESOURCE — $9

The Production Bash Toolkit

6 scripts + shared library + 52-page field guide. The production layer the free snippets don't cover.

Get the Toolkit →

Related Snippets

Frequently Asked Questions

faq — snippet

Does flock create the lock file for me?

The shell redirection creates it, not flock. When you run `exec 200>/run/lock/job.lock`, the `>` opens (and creates) the file on file descriptor 200. flock then places an advisory lock on that open descriptor. The file is just an anchor — its mere existence is not the lock, which is why you should never `rm` it to 'release' the lock. The lock is the kernel's, attached to the descriptor, and it disappears the moment the descriptor closes.

faq — snippet

Why does flock -n exit successfully when the lock is already held?

It does not — `flock -n` itself returns exit code 1 when it cannot acquire the lock. What you choose to do with that failure is up to you. The common pattern for a frequent cron job is `flock -n 200 || exit 0`: treat 'someone else is already running' as a normal, non-error outcome and skip quietly. If you'd rather the skipped run register as a failure, `exit 1` instead, but then a busy minute will spam your MAILTO.

faq — snippet

flock or a PID file — which should I use?

flock, in almost every case. A PID file has a race between the check and the write, and it leaves a stale file behind when the process is killed or the box reboots mid-run, so your next run sees a 'lock' owned by a PID that no longer exists. flock has neither problem: the lock is atomic and the kernel drops it automatically when the holding process dies, however it dies. PID files only make sense when you need the PID itself for something else.

faq — snippet

Where should the lock file live — /tmp, /var/lock, or /run/lock?

/run/lock on any modern systemd system. It's a tmpfs that's cleared cleanly on reboot, which is exactly what you want for a lock. Avoid /tmp: systemd-tmpfiles and tmpreaper periodically delete old files there, and if your lock file is deleted while a long run holds it, a second run opens a brand-new file, locks that instead, and runs concurrently — the precise failure you were trying to prevent.

faq — snippet

How do I make flock wait for the lock instead of skipping the run?

Use `-w` with a timeout in seconds: `flock -w 30 200 || exit 1` waits up to 30 seconds for the lock, then gives up. Use this when the job must eventually run and a short queue is acceptable. Use `-n` (non-blocking) when runs are frequent and a skipped run is harmless because the next one will catch up — a sync or a metrics push, for example.