Bash Find Duplicate Files Script
md5sum, sort by hash, then awk prints paths where the same hash appears twice.
Tested on: Ubuntu 22.04/24.04 LTS · Debian 12 · RHEL 9
A full disk does not always come from new data — often it is the same ISO, screenshot, or backup copied three times under different names. Without a bash find duplicate files workflow, you delete by guesswork and risk removing the only good copy. This script hashes file content, groups byte-for-byte matches, and lists every duplicate path so you reclaim space on purpose.
The Script
Save as find-dupes.sh. Change TARGET to the directory you want to scan. Output is read-only — it never deletes files.
#!/bin/bash # Finds files with identical content using MD5 hashes. # Run on Downloads folder to surface duplicates. set -euo pipefail # WHY: stop on errors, unset variables, or pipe failures CHECK="✓" # WHY: consistent success marker in script output CROSS="✗" # WHY: consistent failure marker in script output TARGET="$HOME/Downloads" # WHY: default scan path — change to your folder if [[ ! -d "$TARGET" ]]; then echo "${CROSS} Directory not found: $TARGET" >&2 exit 1 fi echo "Scanning for duplicates in $TARGET..." find "$TARGET" -type f \ | xargs -r md5sum 2>/dev/null \ | sort \ | awk 'seen[$1]++' echo "${CHECK} Scan complete — lines above are duplicate copies (review before deleting)"
find "$TARGET" -type f lists every regular file under the folder. xargs md5sum computes a checksum per file. sort groups matching hashes together. awk 'seen[$1]++' prints only lines where that hash was already seen — the duplicate copies, not the first file in each group.
How to Use
- Create the script —
nano find-dupes.sh, paste the code, save with Ctrl+X → Y → Enter. - Set
TARGET— point at$HOME/Downloads,/var/www/uploads, or any directory you own. - Make it executable —
chmod +x find-dupes.sh - Run once manually —
./find-dupes.shand read the hash + path pairs printed to the terminal. - Delete with care — keep one file per hash group; remove extras only after you confirm they are true clones.
Schedule with cron
Weekly scans catch duplicate uploads before they dominate disk usage on a VPS or NAS:
# Every Sunday at 3:00 AM — log results for review Monday morning 0 3 * * 0 /home/user/find-dupes.sh >> /var/log/find-dupes.log 2>&1
Cron does not load your shell profile. Always use an absolute path like /home/user/find-dupes.sh, not ./find-dupes.sh.
Variations
Pass the scan directory as an argument
Override TARGET without editing the file:
TARGET="${1:-$HOME/Downloads}" # WHY: $1 when provided, else default Downloads
Run: ./find-dupes.sh /mnt/media
Write results to a dated report file
Useful when cron runs overnight and you review duplicates the next day:
./find-dupes.sh | tee "dupes_$(date +%Y%m%d).log"
Limit scan depth (large trees)
On deep directory trees, cap depth so the first pass finishes faster:
find "$TARGET" -maxdepth 3 -type f \ | xargs -r md5sum 2>/dev/null \ | sort \ | awk 'seen[$1]++'
Common Mistakes
The awk filter prints the second and later files sharing a hash, not the original. Before rm, decide which copy you want to keep — often the shortest path or the newest dated folder.
report.pdf in two folders might differ inside. This script compares bytes via MD5, not names. Two different files named backup.zip only appear if their hashes match.
Network mounts amplify I/O latency. Scan a local copy or run the job on the file server directly. For production volumes, consider fdupes -r after validating this bash pipeline on a sample folder.
Understanding the Commands
Command reference for adapting this bash find duplicate files pipeline:
| Command | What it does |
|---|---|
| find "$TARGET" -type f | Lists every regular file under TARGET recursively |
| xargs md5sum | Runs md5sum on each path; outputs hash path per line |
| xargs -r | GNU extension: skip md5sum when find returns zero files (avoids exit 123 with set -e) |
| sort | Sorts lines alphabetically so identical hashes appear on consecutive rows |
| awk 'seen[$1]++' | Increments a counter per hash; prints the line when count > 0 (duplicate) |
| md5sum | Computes a 128-bit checksum of file contents — same hash means identical bytes |
Frequently Asked Questions
How do I find duplicate files in Linux with bash?
Hash every file with md5sum, sort the output by hash, then use awk 'seen[$1]++' to print paths where the same hash appears more than once. The full pipeline is: find DIR -type f | xargs md5sum | sort | awk 'seen[$1]++'
Is MD5 reliable for finding duplicate files?
For everyday cleanup, yes — identical MD5 hashes mean the files are byte-for-byte copies. MD5 is not meant for cryptographic proof, but it is fast and accurate enough for photos, installers, and duplicate archives on a workstation or VPS.
How do I safely delete files after finding duplicates?
Never pipe this script into rm. Review the list, keep one file per hash group, and delete extras manually. For automated cleanup, use find ... -print first, confirm output, then add -delete only when you are certain.
Why does xargs md5sum fail on some systems?
When find returns no files, GNU xargs exits with code 123, which triggers set -e. Use xargs -r so md5sum is skipped on an empty list. RHEL and Ubuntu both ship GNU xargs by default.
How long does a duplicate scan take?
Time scales with total bytes read, not just file count. A 50 GB media folder may take several minutes because every byte is hashed. Run large scans overnight via cron, or use -maxdepth for a quicker sample pass.
Can I find duplicates by filename instead of content?
Yes, but that only finds name collisions, not identical content. For content matches, use this MD5 pipeline. For duplicate names only: find DIR -type f -printf '%f\n' | sort | uniq -d
/var/www or backup trees — $4/mo with $200 credit for new accounts.