file carving

Sun Oct 1 22:18:00 EDT 2006

Andy Bair <pabair at mitre.org> wrote:
>   On Fri, 2006-09-22 at 10:51 -0400, James R. Van Zandt wrote:
...
>   > "block stitching by analogy" - Use hashes between markers to identify
>   > sections of data that seem to match.  (You could use rsync-style
>   > rolling hashes, but that may be too expensive.)  I think this can be
>   > done in a single pass through the data - just add the hashes to a data
>   > structure (probably a hash table, indexed with some of the bits of the
>   > hash itself), with collisions signaling matches.  It may help to
>   > repeat, using different markers for the hashes.  (You want the
>   > interval between markers to be somewhat smaller than the block size.
>   > The two scans could be done in the same pass.)  Note that if a hashed
>   > section crosses a block boundary, and also matches a section somewhere
>   > else, than that block boundary is probably not a file boundary.
>   > Assume those sections came from different versions of the same file.
>   > Now check the block boundaries following the matching sections.
>   > Suppose in one case one finds the text "is the time for all good me"
>   > followed by a block with high-entropy data, and in another case one
>   > finds "is the time fo", again followed by random data.  Probably,
>   > there is a block somewhere starting "r all good me".  Construct an
>   > xmagic rule and search for it.  If you identify several matching
>   > sections, then you may have a selection of text continuations to
>   > choose from.  If several blocks end at exactly the same point, then
>   > you may have more than one way to stitch the blocks together.  Repeat,
>   > looking at the beginning of blocks.
>
>   Phew, you lost me, can you clarify?

Suppose you have something like the following, where blocks are 32
bytes long, and I've indicated block boundaries with commas.  I'm
showing (parts of) 13 blocks, numbered 1 through 13.  Think of blocks
filled with numerals as "high entropy":

...22now ,is the time for all good men to ,0000000000000000000000000000000,111111111111...
     ^^^^^^^^^^^  a hashed section
111111111,Lorem ipsum d now is the time fo,4444444444444444444444444444444,000000000000...
                        ^^^^^^^^^^ a matching hashed section
000000000,66666666666666666666666666666666,0000000000000000000000000000000,111111111111...
111111111,88888888888888888888888888888888,r all good men to come to the a,000000000000...
                                           ^^^^ probably the
                              continuation of the above block

(Please unwrap the lines if necessary.)

Suppose the hash tells you that the text "now is the" in the first and
second blocks matches a section of the 5th block.  However, the
difference in entropy tells you that block 6 probably should not
follow block 5.  So what's the right continuation?  According to block
2, the text after "is the time fo" may be "r all good men".  Block 12
starts that way, so that's one possibility to try.  Incidentally, you
can then continue the process - after block 2 we expect a block
starting "come to the a".

           - Jim Van Zandt