Maildir Deduplication by Message-ID

Introduction

This is a set of scripts centered around finding duplicate messages in one-or-more Maildir-style directories (one message per file). I wrote them to de-duplicate personal archives of mailing lists, which had multiple copies of the same message, but delivered via different transport paths (MX, protocol, mail sync time, etc.).

Assumptions and caveats

These tools do not have much in the way of child-proofing. If you ask them to do something stupid, they will blindly do it. It's quite easy to construct a pipeline that will destroy data. If you're not comfortable with that, please don't use these.

Running these scripts while something else is modifying the same messages is likely to lead to confusion, race conditions, and maybe lost data. Adding new messages is fine, reading messages is fine, just don't move/change/delete anything.

These scripts expect Maildir. If you manage to feed in an mbox-style mailbox, they will likely treat it as just one message.

For duplicate detection, we assume the Message-ID field is suitable for identifying duplicate messages. That is not a universally valid assumption. Be careful. ckmddupes can confirm the assumption.

The scripts

mddupes script

File: mddupes.pl

Grovel over a tree of message files, reading each message, finding the Message-ID from each, and reporting any duplicates encountered.

Theory of operation

Assumptions and caveats

Assumes every entry in each directory is a message file.

It makes no attempt to identify subdirectories and/or non-message files.

If it encounters a subdirectory it will try to open it as a file; this ends up doing nothing on my system, but for all I know it could make demons fly out of your nose. If it encounters a non-message file it will try to open it like a message. At best it will then complain it could not find a Message-ID. At worst it will interpret random data as a Message-ID and give bad results.

Feed it the names of the Maildir "new" and/or "cur" directories, while nothing else is using those directories, and it should be OK. The findmaildirs -d command is useful for this.

nmdupes script

File: nmdupes.py

Run a Notmuch search query, and print Message-IDs for any resulting messages that are contained in more than one file.

Commentary

If you have your mail indexed by Notmuch, you can use it to look for duplicate Message-IDs. Since Notmuch already has the Message-IDs and file names indexed, it can run much faster than mddupes (which has to open and read each message file). On my system, it was hours versus minutes.

Provide a Notmuch search query. To run against all mail, simply use * as the query (escaped for your shell, as appropriate).

ckmddupes script

File: ckmddupes.pl

Read the output of mddupes / nmdupes, and check each pair of duplicates, by reading the entire file for each and making sure they are identical in content (not just Message-ID).

Theory of operation

Assumptions and caveats

Minor differences in messages (in particular, footers added to some but not others) will still be reported as differing. Whether or not that's the right thing depends on your scenario.

linkify-pairs script

File: linkify-pairs.pl

Read the output of mddupes / nmdupes / ckmddupes, and turn each pair of distinct duplicate files into a pair of hardlinks to the same file (inode).

Theory of operation

Assumptions and caveats

Assumes the input is correct, i.e., that the files actually are duplicates. If you feed it a list of unrelated files, it will happily delete half of them.

findmaildirs script

File: findmaildirs.sh

Find directories that look like Maildir mail folders.

Theory of operation

Assumptions and caveats

Assumes any directory with cur,new,tmp subdirectories is a Maildir.

Usage examples

By themselves, none of the scripts does anything to fix a problem. They're intended to be used as building blocks, with pipelines and redirection.

mddupes $( findmaildirs /oldmail -d ) | ckmddupes | linkify-pairs

The above will look for duplicate messages in Maildirs under the /oldmail directory, confirm the candidates are duplicates, and turn any duplicates into one hardlinked file.

nmdupes \* | linkify-pairs

The above will look for duplicate messages in all mail known to Notmuch, and turn any messages with duplicate Message-IDs into one hardlinked file. Whether or not the content of said messages is the same is not checked.

nmdupes \* | ckmddupes | linkify-pairs

The above will look for duplicate messages in all mail known to Notmuch, confirm the candidates are duplicates, and turn any duplicates into one hardlinked file. It will take much longer to run, since it has to read and compare every message.

nmdupes \* >nmdupes.out 2>nmdupes.err
ckmddupes <nmdupes.out >ckmddupes.out 2>ckmddupes.err
wc -l nmdupes.out ckmddupes.out ckmddupes.out
linkify-pairs <ckmddupes.out

The above does the same thing as the previous example, but breaks it up into stages, allowing opportunities for review. It uses wc(1) to count lines for some basic statistics.

mddupes $( findmaildirs -d ) | ckmddupes | cut -f2 | xargs rm

The above looks for duplicates in Maildirs under the current directory (and subdirectories), and permanently deletes duplicates, leaving just one copy of each pair.