Skip to content

Retry a Command with Exponential Backoff in Bash

retrybackoffresiliencedevopsscripting
4 min read

Quick Answer

A deploy step that fails once and dies is the difference between a script that works in a demo and one that survives production. Transient failures are everywhere — a DNS blip, a 503 from a recovering service, a fresh container's database that isn't accepting connections yet — and the fix isn't to die on the first error, it's to retry with exponential backoff. A retry loop runs the command, and on failure waits a delay that doubles each round (1s, 2s, 4s, 8s) up to a cap, with a little jitter so parallel callers don't retry on the same beat and re-overwhelm the thing that's recovering. Cap the attempts so a genuinely dead dependency fails fast instead of looping forever. Retry transient failures only — a 404 or a syntax error fails identically on every attempt, so retrying it just wastes time. The most common use is waiting for a port: retry nc -z db 5432 until the database is ready, then run the migration.

I once spent forty minutes at eleven at night debugging a deploy that wasn't broken. The release script ran the database migration, the migration threw connection refused, the script exited non-zero, the deploy rolled itself back, and I got paged. I read the migration. I read the logs. I checked the database — it was up, it was healthy, it accepted my connection instantly. I re-ran the deploy and it worked. So I did what you do at eleven at night and chalked it up to gremlins, until it happened again two days later and I finally watched the timing: the script brought up a fresh database container and started the migration about six seconds before Postgres finished initializing and began accepting connections. The migration was racing the database's boot, and most of the time it won, and the times it lost I lost forty minutes.

The script wasn't wrong about anything except its assumption that a dependency is ready the instant you ask for it. In production, dependencies are eventually ready. Networks blip. A service you call returns a 503 for the two seconds it takes to finish a rolling restart. An API rate-limits you with a 429 it fully expects you to retry. Treating the first failure as fatal turns every one of these normal, transient conditions into a paged engineer.

The fix is to retry — but retrying naively is its own trap. Retry instantly and you hammer a recovering service into staying down. Retry forever and a genuinely dead dependency hangs your script indefinitely. Retry a 404 and you wait a minute to confirm what you already knew. Good retries are bounded, backed off, and selective.

A retry function you can reuse

bash
#!/bin/bash # Script: retry.sh # Purpose: Survive transient failures (a flaky network, a service still booting) # instead of dying on the first error # Usage: retry.sh (or source the retry() function into your own scripts) set -euo pipefail CHECK="✓" CROSS="✗" # Retry a command with exponential backoff and jitter. # Usage: retry <max_attempts> <command> [args...] retry() { local max_attempts="$1"; shift local attempt=1 local delay=1 # base delay in seconds — doubles each round local max_delay=30 # cap so the backoff never runs away until "$@"; do if (( attempt >= max_attempts )); then echo "$CROSS '$*' failed after $attempt attempts" >&2 return 1 fi # 0–2s of jitter so parallel callers don't all retry on the same beat local pause=$(( delay + RANDOM % 3 )) echo "$CROSS attempt $attempt failed — retrying in ${pause}s" >&2 sleep "$pause" attempt=$(( attempt + 1 )) delay=$(( delay * 2 )) if (( delay > max_delay )); then delay=$max_delay; fi done echo "$CHECK '$*' succeeded on attempt $attempt" } # The fix for my 11pm deploy: wait for Postgres to accept connections first. retry 6 nc -z -w 2 db.internal 5432 echo "$CHECK database reachable — running migration" # A release artifact that occasionally 503s while the CDN warms up. retry 5 curl -fsS --max-time 10 -o /tmp/pkg.tar.gz https://releases.example.com/pkg.tar.gz

The whole engine is until "$@"; do ... done. until runs the command and executes the loop body only when it fails, exiting the instant it succeeds. Passing the command as "$@" (after shift-ing past the attempt count) means the function retries anything — a curl, an ssh, a port check, your own script — without the function caring what it is.

The backoff is the three lines at the bottom of the loop: sleep for the current delay plus a touch of jitter, then double the delay, capped at max_delay. That gives you 1s, 2s, 4s, 8s, 16s, 30s, 30s… The jitter — RANDOM % 3 — looks trivial, but it's what stops a fleet of machines that all failed at the same second from retrying at the same second and knocking the recovering service straight back over.

The part that's easy to get wrong

bash
# Good: a transient failure that retrying can fix retry 6 nc -z -w 2 db.internal 5432 # Bad: retrying a deterministic failure just delays the error 30 seconds retry 6 curl -fsS https://api.example.com/v1/thing-that-returns-404

A retry loop is only as smart as what you point it at. The database-readiness check belongs in a loop because the answer changes — it's "no" until the database boots, then "yes." A request that returns 404 returns 404 on attempt one and attempt six; the loop just postpones the failure and buries the real status under retry logging. When you can, branch on the exit code or HTTP status and retry only the transient ones.

For plain curl, its built-in --retry 5 --retry-delay 2 does most of this and is simpler — reach for the function when the thing you're retrying isn't curl, or when you want one backoff policy across a database probe, an ssh call, and a download all at once.

That eleven-o'clock deploy never paged me again once the migration waited for the port instead of assuming it. The interesting part is what didn't change: the database still took its six seconds to boot, the network still blipped occasionally. Retrying didn't make the dependencies faster — it stopped a normal, transient slowness from being treated as a fatal error.

Retries are the third leg of an unattended job that survives. Locking with flock stops a slow run from stacking; timeout stops a hung run from jamming the lock; retries stop a transient blip from killing the run outright. The Hardened Cron Wrapper Generator wires the retry loop into a full wrapper alongside the other two, and Bash Scripts That Survive Cron is the end-to-end version.

Run this script on a real Linux server

Get $200 free credit — DigitalOcean

Get $200 Free →

Affiliate link · we earn a commission

Want to test the wait-for-port pattern for real? Bring up a droplet, start a service on a delay, and watch retry poll the port until it opens. The rest of the library is at bashsnippets.xyz — the website up-check and restart a service if it stopped pair naturally with this one.

BashSnippets logo

Written by Anguishe

Creator of BashSnippets.xyz

bashsnippets.xyz/about

Run this script on a real Linux server

Get $200 free credit — DigitalOcean

Get $200 Free →

Affiliate link · we earn a commission

Need a domain for your next project?

Register with Namecheap — free WHOIS privacy included

Check Domain Prices →

Affiliate link · we earn a commission

PAID RESOURCE — $9

The Production Bash Toolkit

6 scripts + shared library + 52-page field guide. The production layer the free snippets don't cover.

Get the Toolkit →

Related Snippets

Frequently Asked Questions

faq — snippet

How do I retry a command until it succeeds in bash?

Wrap it in an until loop: `until mycommand; do ...; done` runs the body each time the command fails and stops the moment it succeeds. The danger is an unbounded loop against a permanently-broken dependency, so always add a maximum attempt count and return failure when you hit it. The reusable retry() function on this page takes the max attempts as its first argument and the command as the rest, so you write `retry 5 curl ...` and never reimplement the loop.

faq — snippet

What is exponential backoff and why add jitter?

Exponential backoff doubles the wait after each failure — 1s, 2s, 4s, 8s — instead of retrying instantly. Instant retries hammer a service that's already struggling and keep it down; backoff gives it room to recover. Jitter adds a small random offset to each wait. Without it, many clients that failed at the same instant retry at the same instant, producing a synchronized thundering herd that re-overwhelms the service on every round. A few hundred milliseconds to a couple of seconds of randomness spreads the retries out.

faq — snippet

How do I wait for a database or port to be ready before continuing?

Retry a connectivity check until it passes: `retry 10 nc -z -w 2 db.internal 5432` polls the Postgres port, backing off between tries, and returns success the moment the port accepts a connection. This is the wait-for-it pattern, and it's the cure for the deploy script that runs migrations a half-second before the database container finishes booting. Use nc -z (zero-I/O port scan) or bash's own /dev/tcp redirection; both tell you only whether the port is open, which is what you want here.

faq — snippet

Should I retry every kind of failure?

No, and this is the mistake that makes retries dangerous. Retry transient failures — timeouts, connection-refused, HTTP 429 and 5xx, DNS hiccups — the ones that might succeed if you try again. Do not retry deterministic failures: a 404, a 401, a syntax error, a missing file. They fail identically every time, so retrying them only delays the inevitable while hiding the real error behind a wall of retry noise. When you can, inspect the exit code or HTTP status and only loop on the codes that are worth looping on.

faq — snippet

curl already has --retry — when do I need my own loop?

Use curl --retry for plain curl calls; it handles transient HTTP and connection errors with its own backoff and is simpler than rolling your own. You need the generic retry() function when the thing you're retrying isn't curl — a database connection check, an ssh command, a flaky CLI tool, a custom health probe — or when you want one consistent backoff policy across all of them, or a condition curl can't express. The function wraps any command; curl --retry only wraps curl.