Trying to do something similar to a backup to avoid data corruption

JonJ678 · 22 Jun 2009 at 05:18

Hey guys. I'm quite happy using rsync to backup files, but at present I'm trying to do something slightly different. I've developed a phobia of files corrupting and hope to solve this with a backup. I'd rather write my own script for this (I do not know how to yet, but I will learn), so this is more to check that my reasoning is sound before I start trying to write it.

The idea is to have two identical directories on different partitions, maintaining exactly the same files in the same directory tree. In English, what I'm intending the script to do is:

-------------------------------------------------------------------------------------------------------

Check the md5sum of a file against a previously generated list
If they match, move onto next file.
If not, append the name of the file to a list of potentially corrupted files

Do the same in the second directory

Compare the two lists of potentially corrupt files,
If the same file name appears in both, but with different checksums, move the files to a folder termed ''corrupt" within their own respective partitions
If a file appears in one but not the other, move the corrupt file to a folder termed "replaced" and copy the good one across.

Print to a file how many of each exchange occurred each run, and the names of those that were moved.

-------------------------------------------------------------------------------------------------------

This relies upon initially generating a checksum file which then never itself becomes corrupt. Some form of self checking mechanism for this would be wise, at the least check that the main md5sum file in each directory match each other and abort if they do not.

Will require running a small script to move new files into the directory, to the effect of calculating their checksum and appending it to the appropriate file before/after copying them across.

Main concerns are
1/That the above is inherently unworkable
2/That it will lead to destroying data, especially that which is copied into the directory after the initial index file is generated
3/The processor and disk overhead will be excessive
4/That bash will do this poorly and I would be wiser to write it in c.

Any feedback welcome.
Cheers

Minto · 22 Jun 2009 at 10:48

you would be engaged in a constant battle to update the md5's of every little file that gets modified as you use your computer.
Even if you resolve that, you shouldn't move or copy automatically, you should print to a file or notify the user in some way and let them make the call in case the file is still doing something important (despite what your checksum might suggest!)

tntcoder · 22 Jun 2009 at 12:29

Give Tripwire a look, it will monitor directories against a checksum database and scan at regular intervals.

But as Minto said, you should really do the actual copying yourself, or at least confirm it. I cant believe you have so much corruption occurring that it would be a hassle to confirm each occurrence or resolve it manually once detected?

JonJ678 · 22 Jun 2009 at 18:16

Doing this to / would be a terrible idea, similarly to almost anywhere else really.
This would be for storing music, iso files etc which are rarely changed but which I'd rather not lose. As such the files in question wont be doing anything, so moving them around isn't going to cause any problems. The problem I have with rsync here is that it looks like it will faithfully mirror corrupted files between directories, even with the -c flag. Raid 1 is flawed in the same way.

I've met tripwire but never used it, will give some thought to applying it here. Cheers.

You may well be right in saying that it would be wiser/simpler to resolve the occurrences manually, will have to have a think about how wise automating this would be.

Cheers guys