Safe archive

A safe archive format is like the tar file format, only with checksumming and duplication built in so that data corruption can be automatically recovered from.

When I was writing about archiving digital works to last a long time, I wrote that the filesystems with the best guarantees about data integrity (ZFS, BtrFS) are also the ones which have only one implementation, and would be difficult to re-implement if their current implementations went away.

I suggested that the solution to this would be to keep the work in a Git repository, because Git is very good at detecting data corruption. Unfortunately, Git is also very bad at fixing it, mainly because it stores data in compressed form, so the data that gets corrupted is different from the data it holds a safe hash of. There are ways around it, but it will always be very slow.

The format should also be simple to implement, so that if the tools originally built for it ever stop working, the data inside still can be recovered with relative ease by re-implementing it.

Techniques which can be used for ensuring data recoverability include:

SECDED codes
Merkle tree hashing of the archive contents
Compress-and-duplicate (run-length encode the archive contents, then duplicate the dictionary of repeated strings)