Projekt

Allgemein

Profil

Bash find duplicate files » Historie » Version 1

Jeremias Keihsler, 13.01.2017 10:43

1 1 Jeremias Keihsler
This one-liner is taken from:
2
http://www.commandlinefu.com/commands/view/3555/find-duplicate-files-based-on-size-first-then-md5-hash
3
4
and had been explained at:
5
http://heyrod.com/snippet/t/linux.html
6
7
<pre><code class="bash">
8
find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d  | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate
9
</code></pre>
10
11
the explanation is as following:
12
<pre>
13
1 $ find -not -empty -type f -printf "%s\n" | \
14
2 > sort -rn | \
15
3 > uniq -d | \
16
4 > xargs -I{} -n1 find -type f -size {}c -print0 | \
17
5 > xargs -0 md5sum | \
18
6 > sort | \
19
7 > uniq -w32 --all-repeated=separate | \
20
8 > cut -d" " -f3-
21
</pre>
22
23
You probably want to pipe that to a file as it runs slowly.
24
25
If I understand this correctly:
26
27
Line 1 enumerates the real files non-empty by size.
28
Line 2 sorts the sizes (as numbers of descending size).
29
Line 3 strips out the lines (sizes) that only appear once.
30
For each remaining size, line 4 finds all the files of that size.
31
Line 5 computes the MD5 hash for all the files found in line 4, outputting the MD5 hash and file name. (This is repeated for each set of files of a given size.)
32
Line 6 sorts that list for easy comparison.
33
Line 7 compares the first 32 characters of each line (the MD5 hash) to find duplicates.
34
Line 8 spits out the file name and path part of the matching lines.