Dataset Comparison

George Hincapie · 3 Sep 2013 at 10:19

I have two datasets. Both sets contain files in a variety of formats, but all have Windows type file extensions.

There are files common to both sets, but there are also files unique to both sets. I'd like to be able to identify which files are common, but also which are unique.

What's the best way to go about this?

I had thought that hashing all the files would be a start, then I could possibly compare the hash values or is there a smarter way of doing this?

Linkex · 3 Sep 2013 at 11:37

It really depends on accuracy required. Hashtag is the best way to ensure a match is actually a match, but if filesize and name is sufficient, them doing a DIR /s /N dump and importing into excel is a quick and dirty way to compare them

George Hincapie · 3 Sep 2013 at 12:38

Is Hashtag an application? Have you got a link for it?

Linkex · 3 Sep 2013 at 12:40

Damn you twitter

-hashtag +hashing

George Hincapie · 3 Sep 2013 at 13:05

LOL

The file names aren't the same (although the file contents are) because they've been exported from an eDiscovery package. It's going to have to be the hashing method.