advanced files cron-ready

Bash Find Duplicate Files Script

By Anguishe · Tested on Ubuntu 22.04/24.04 LTS, Debian 12, RHEL 9

Quick answer: Hash every file with md5sum, sort by hash, then awk prints paths where the same hash appears twice. Tested on: Ubuntu 22.04/24.04 LTS · Debian 12 · RHEL 9

A full disk does not always come from new data — often it is the same ISO, screenshot, or backup copied three times under different names. Without a bash find duplicate files workflow, you delete by guesswork and risk removing the only good copy. This script hashes file content, groups byte-for-byte matches, and lists every duplicate path so you reclaim space on purpose.

Anguishe

Creator, BashSnippets.xyz · Self-taught Linux · Florida
Last updated: May 2026

The Script

Save as find-dupes.sh. Change TARGET to the directory you want to scan. Output is read-only — it never deletes files.

find-dupes.sh

#!/bin/bash
# Finds files with identical content using MD5 hashes.
# Run on Downloads folder to surface duplicates.

set -euo pipefail  # WHY: stop on errors, unset variables, or pipe failures

CHECK="✓"   # WHY: consistent success marker in script output
CROSS="✗"   # WHY: consistent failure marker in script output

TARGET="$HOME/Downloads"  # WHY: default scan path — change to your folder

if [[ ! -d "$TARGET" ]]; then
  echo "${CROSS} Directory not found: $TARGET" >&2
  exit 1
fi

echo "Scanning for duplicates in $TARGET..."
find "$TARGET" -type f \
  | xargs -r md5sum 2>/dev/null \
  | sort \
  | awk 'seen[$1]++'

echo "${CHECK} Scan complete — lines above are duplicate copies (review before deleting)"

✓ What this does, line by line

find "$TARGET" -type f lists every regular file under the folder. xargs md5sum computes a checksum per file. sort groups matching hashes together. awk 'seen[$1]++' prints only lines where that hash was already seen — the duplicate copies, not the first file in each group.

How to Use

Create the script — nano find-dupes.sh, paste the code, save with Ctrl+X → Y → Enter.
Set TARGET — point at $HOME/Downloads, /var/www/uploads, or any directory you own.
Make it executable — chmod +x find-dupes.sh
Run once manually — ./find-dupes.sh and read the hash + path pairs printed to the terminal.
Delete with care — keep one file per hash group; remove extras only after you confirm they are true clones.

Schedule with cron

Weekly scans catch duplicate uploads before they dominate disk usage on a VPS or NAS:

crontab

# Every Sunday at 3:00 AM — log results for review Monday morning
0 3 * * 0 /home/user/find-dupes.sh >> /var/log/find-dupes.log 2>&1

💡 Use the full path in cron

Cron does not load your shell profile. Always use an absolute path like /home/user/find-dupes.sh, not ./find-dupes.sh.

Variations

Pass the scan directory as an argument

Override TARGET without editing the file:

find-dupes-arg.sh

TARGET="${1:-$HOME/Downloads}"  # WHY: $1 when provided, else default Downloads

Run: ./find-dupes.sh /mnt/media

Write results to a dated report file

Useful when cron runs overnight and you review duplicates the next day:

terminal

./find-dupes.sh | tee "dupes_$(date +%Y%m%d).log"

Limit scan depth (large trees)

On deep directory trees, cap depth so the first pass finishes faster:

find-dupes.sh (depth)

find "$TARGET" -maxdepth 3 -type f \
  | xargs -r md5sum 2>/dev/null \
  | sort \
  | awk 'seen[$1]++'

Common Mistakes

⚠ Deleting the first path in each group

The awk filter prints the second and later files sharing a hash, not the original. Before rm, decide which copy you want to keep — often the shortest path or the newest dated folder.

⚠ Assuming same filename means duplicate content

report.pdf in two folders might differ inside. This script compares bytes via MD5, not names. Two different files named backup.zip only appear if their hashes match.

⚠ Running md5sum on millions of tiny files over NFS

Network mounts amplify I/O latency. Scan a local copy or run the job on the file server directly. For production volumes, consider fdupes -r after validating this bash pipeline on a sample folder.

Understanding the Commands

Command reference for adapting this bash find duplicate files pipeline:

Command	What it does
find "$TARGET" -type f	Lists every regular file under TARGET recursively
xargs md5sum	Runs md5sum on each path; outputs `hash path` per line
xargs -r	GNU extension: skip md5sum when find returns zero files (avoids exit 123 with set -e)
sort	Sorts lines alphabetically so identical hashes appear on consecutive rows
awk 'seen[$1]++'	Increments a counter per hash; prints the line when count > 0 (duplicate)
md5sum	Computes a 128-bit checksum of file contents — same hash means identical bytes

Frequently Asked Questions

How do I find duplicate files in Linux with bash?

Hash every file with md5sum, sort the output by hash, then use awk 'seen[$1]++' to print paths where the same hash appears more than once. The full pipeline is: find DIR -type f | xargs md5sum | sort | awk 'seen[$1]++'

Is MD5 reliable for finding duplicate files?

For everyday cleanup, yes — identical MD5 hashes mean the files are byte-for-byte copies. MD5 is not meant for cryptographic proof, but it is fast and accurate enough for photos, installers, and duplicate archives on a workstation or VPS.

How do I safely delete files after finding duplicates?

Never pipe this script into rm. Review the list, keep one file per hash group, and delete extras manually. For automated cleanup, use find ... -print first, confirm output, then add -delete only when you are certain.

Why does xargs md5sum fail on some systems?

When find returns no files, GNU xargs exits with code 123, which triggers set -e. Use xargs -r so md5sum is skipped on an empty list. RHEL and Ubuntu both ship GNU xargs by default.

How long does a duplicate scan take?

Time scales with total bytes read, not just file count. A 50 GB media folder may take several minutes because every byte is hashed. Run large scans overnight via cron, or use -maxdepth for a quicker sample pass.

Can I find duplicates by filename instead of content?

Yes, but that only finds name collisions, not identical content. For content matches, use this MD5 pipeline. For duplicate names only: find DIR -type f -printf '%f\n' | sort | uniq -d

Hash terabytes of uploads on a Droplet with room to breathe Running duplicate scans on a laptop ties up your desk machine. A DigitalOcean Droplet with a block storage volume gives you a dedicated place to scan /var/www or backup trees — $4/mo with $200 credit for new accounts.

Get $200 Free →

Free Tool

Wrap this scan in a production-ready template

Use the Bash Boilerplate Generator to add logging, lock files, and argument parsing around this duplicate finder.

Try the Generator →