This one-liner is taken from:
http://www.commandlinefu.com/commands/view/3555/find-duplicate-files-based-on-size-first-then-md5-hash
and had been explained at:
http://heyrod.com/snippet/t/linux.html
find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
the explanation is as following:
1 $ find -not -empty -type f -printf "%s\n" | \ 2 > sort -rn | \ 3 > uniq -d | \ 4 > xargs -I{} -n1 find -type f -size {}c -print0 | \ 5 > xargs -0 md5sum | \ 6 > sort | \ 7 > uniq -w32 --all-repeated=separate | \ 8 > cut -d" " -f3-
You probably want to pipe that to a file as it runs slowly.
If I understand this correctly:
Line 1 enumerates the real files non-empty by size.
Line 2 sorts the sizes (as numbers of descending size).
Line 3 strips out the lines (sizes) that only appear once.
For each remaining size, line 4 finds all the files of that size.
Line 5 computes the MD5 hash for all the files found in line 4, outputting the MD5 hash and file name. (This is repeated for each set of files of a given size.)
Line 6 sorts that list for easy comparison.
Line 7 compares the first 32 characters of each line (the MD5 hash) to find duplicates.
Line 8 spits out the file name and path part of the matching lines.
Von Jeremias Keihsler vor fast 8 Jahren aktualisiert · 1 Revisionen