Bash find duplicate files » Historie » Version 1
Jeremias Keihsler, 13.01.2017 10:43
1 | 1 | Jeremias Keihsler | This one-liner is taken from: |
---|---|---|---|
2 | http://www.commandlinefu.com/commands/view/3555/find-duplicate-files-based-on-size-first-then-md5-hash |
||
3 | |||
4 | and had been explained at: |
||
5 | http://heyrod.com/snippet/t/linux.html |
||
6 | |||
7 | <pre><code class="bash"> |
||
8 | find -not -empty -type f -printf "%s\n" | sort -rn | uniq -d | xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | sort | uniq -w32 --all-repeated=separate |
||
9 | </code></pre> |
||
10 | |||
11 | the explanation is as following: |
||
12 | <pre> |
||
13 | 1 $ find -not -empty -type f -printf "%s\n" | \ |
||
14 | 2 > sort -rn | \ |
||
15 | 3 > uniq -d | \ |
||
16 | 4 > xargs -I{} -n1 find -type f -size {}c -print0 | \ |
||
17 | 5 > xargs -0 md5sum | \ |
||
18 | 6 > sort | \ |
||
19 | 7 > uniq -w32 --all-repeated=separate | \ |
||
20 | 8 > cut -d" " -f3- |
||
21 | </pre> |
||
22 | |||
23 | You probably want to pipe that to a file as it runs slowly. |
||
24 | |||
25 | If I understand this correctly: |
||
26 | |||
27 | Line 1 enumerates the real files non-empty by size. |
||
28 | Line 2 sorts the sizes (as numbers of descending size). |
||
29 | Line 3 strips out the lines (sizes) that only appear once. |
||
30 | For each remaining size, line 4 finds all the files of that size. |
||
31 | Line 5 computes the MD5 hash for all the files found in line 4, outputting the MD5 hash and file name. (This is repeated for each set of files of a given size.) |
||
32 | Line 6 sorts that list for easy comparison. |
||
33 | Line 7 compares the first 32 characters of each line (the MD5 hash) to find duplicates. |
||
34 | Line 8 spits out the file name and path part of the matching lines. |