Version 1 - Historie - Bash find duplicate files - CentOS 6 - OMB Redmine

Bash find duplicate files » Historie » Version 1

Jeremias Keihsler, 13.01.2017 10:43

-Jeremias Keihsler
+This one-liner is taken from:
 http://www.commandlinefu.com/commands/view/3555/find-duplicate-files-based-on-size-first-then-md5-hash
 and had been explained at:
 http://heyrod.com/snippet/t/linux.html
 <pre><code class="bash">
 find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d  | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
 </code></pre>
 the explanation is as following:
 <pre>
 $ find -not -empty -type f -printf "%s\n" | \
 > sort -rn | \
 > uniq -d | \
 > xargs -I{} -n1 find -type f -size {}c -print0 | \
 > xargs -0 md5sum | \
 > sort | \
 > uniq -w32 --all-repeated=separate | \
 > cut -d" " -f3-
 </pre>
 You probably want to pipe that to a file as it runs slowly.
 If I understand this correctly:
 Line 1 enumerates the real files non-empty by size.
 Line 2 sorts the sizes (as numbers of descending size).
 Line 3 strips out the lines (sizes) that only appear once.
 For each remaining size, line 4 finds all the files of that size.
 Line 5 computes the MD5 hash for all the files found in line 4, outputting the MD5 hash and file name. (This is repeated for each set of files of a given size.)
 Line 6 sorts that list for easy comparison.
 Line 7 compares the first 32 characters of each line (the MD5 hash) to find duplicates.
 Line 8 spits out the file name and path part of the matching lines.