The backup ran fine every night for fourteen months, and then it didn't run for nine days, and nothing told me. There was no error in the log, no failed-job alert, no bounced cron mail. The nightly mysqldump had simply hung — the database had a long-held lock from a runaway analytics query, the dump opened its transaction and sat there waiting for it, forever. Cron launched it at 2am, it never exited, and because I'd been smart enough to wrap it in a lock (so two dumps couldn't run at once), every subsequent night's run saw the lock still held by the zombie from the 9th and skipped quietly. The clever lock turned a one-night hang into a nine-day outage. I found it when I went to restore a table and discovered my newest "backup" was a mysqldump process that had been running since the previous Tuesday.
The lesson is blunt: under cron, a hung command is worse than a failed one. A failed command exits, frees its lock, and the next run tries again. A hung command exits never. It holds resources, blocks its own future runs, and produces exactly zero signal because it never gets far enough to log anything. You cannot rely on a command to bound its own runtime — the whole problem is that it's stuck somewhere it can't time itself out of.
So bound it from the outside.
Bounding the runtime
timeout -k "$KILL_GRACE" "$MAX_RUNTIME" mysqldump ... is the entire mechanism. At five minutes, timeout sends the dump a SIGTERM. A well-behaved program treats SIGTERM as "wrap up and exit." But the dump from my outage wasn't misbehaving — it was blocked in the kernel waiting on a lock, and a process in that state can't act on SIGTERM. That's what -k 20s is for: twenty seconds after the polite signal, timeout sends SIGKILL, which the kernel enforces unconditionally. Nothing survives SIGKILL.
Reading what happened
The exit code is the difference between a useful log and a mystery. 124 means the command was still running at the deadline. 137 is 128 + 9 — it had to be force-killed because it ignored the first signal. Any other non-zero code is the command's own failure, and you should treat it differently. Collapsing all three into "backup failed" throws away the one piece of information that tells you whether you have a slow database, a wedged one, or a broken dump command. (If you ever forget which code means what, the Bash Exit Code Lookup decodes 124 and 137 directly.)
The .partial dance matters more than it looks. If you redirect straight to the real backup file and the command times out mid-write, you've replaced last night's good backup with a half-written, unrestorable file — and you won't know until you need it. Writing to a temp path and mv-ing only on a clean exit means a failed or timed-out run leaves the previous good backup untouched.
For commands that talk to the network, add the tool's own timeout as well — curl --max-time, ssh -o ConnectTimeout, a net_read_timeout on the dump. Those fire first and fail cleanly. timeout is the outer hard stop for the night the inner one doesn't.
A timeout is what makes a lock safe. Locking a job to a single instance with flock stops overlap, but a hang inside the locked job holds that lock forever — which is precisely how my nine-day gap happened. Bound the runtime and the lock always gets released, on time, every time. The two together, plus retrying transient failures with backoff, are the core of an unattended job that fails loudly instead of disappearing. The Hardened Cron Wrapper Generator composes all three, and Bash Scripts That Survive Cron walks the whole decision.
Run this script on a real Linux server
Get $200 free credit — DigitalOcean
Get $200 Free →Affiliate link · we earn a commission
If you want to reproduce the hang safely, a $4 droplet, a deliberately slow sleep 999, and a timeout 5s is the cheapest way to watch all three exit codes for yourself. More of the library is at bashsnippets.xyz, including killing a runaway process by name when a hang escapes the timeout entirely.