file carving

Fri Sep 22 10:52:01 EDT 2006

Andy -

Thanks again for the interesting talk last night!

Some thoughts I have had since then:

"EOF patterns" - Search for allocation blocks that end with a string
of zeros, where the next block does not start with zeros.  Those are
probably file boundaries (where the unused space after the file data
was left as zeros).  Next, calculate entropy and other measures on
data segments one quarter the length of allocation blocks.  Look for
blocks where the last segment is much different from the first segment
and also from the first segment of the following block.  These too are
probably file boundaries (where the unused space has data left over
from a previous file).

"block data typing" - Even a high-entropy data file may have enough
internal structure to let you identify the data type of a block, even
if you don't have the start of the file.  E.g. for gzip files:
http://www.gzip.org/recover.txt
http://www.urbanophile.com/arenn/coding/gzrt/gzrt.html
Maybe this requires "gzip --resyncable"

"dynamic xmagic" - Suppose you know enough about the internal
structure of a file to expect a certain pattern at a particular offset
into the file (say, the directory of a zip file), but when you look at
that offset in your data, the pattern is missing.  Evidently you're
looking at the wrong block.  That means there is some other block with
that pattern.  Figure out what the offset into the block must be,
construct an xmagic rule for it, and search.

"block stitching by analogy" - Use hashes between markers to identify
sections of data that seem to match.  (You could use rsync-style
rolling hashes, but that may be too expensive.)  I think this can be
done in a single pass through the data - just add the hashes to a data
structure (probably a hash table, indexed with some of the bits of the
hash itself), with collisions signaling matches.  It may help to
repeat, using different markers for the hashes.  (You want the
interval between markers to be somewhat smaller than the block size.
The two scans could be done in the same pass.)  Note that if a hashed
section crosses a block boundary, and also matches a section somewhere
else, than that block boundary is probably not a file boundary.
Assume those sections came from different versions of the same file.
Now check the block boundaries following the matching sections.
Suppose in one case one finds the text "is the time for all good me"
followed by a block with high-entropy data, and in another case one
finds "is the time fo", again followed by random data.  Probably,
there is a block somewhere starting "r all good me".  Construct an
xmagic rule and search for it.  If you identify several matching
sections, then you may have a selection of text continuations to
choose from.  If several blocks end at exactly the same point, then
you may have more than one way to stitch the blocks together.  Repeat,
looking at the beginning of blocks.

"block stitching by incremental validation" - Suppose you have
identified the beginning of a gzip file, but not its end.  Append
another high-entropy data block and attempt to decompress it.  If
gunzip fails only at the end of the added block, then it's a valid
continuation.

"incremental carving" - Your tool should let you essentially build a
filesystem incrementally.  I.e. it should make it convenient to:
 - specify a file structure (which blocks in which order form the
file) and attributes (type, name, age, etc.)  If your data actually
has a filesystem, then you should be able to populate your file
structure accordingly.  (Then you can work on organizing the data in
the free blocks.)
 - specify other information about the data that you have not yet
organized into files (block entropy, SOF patterns, EOF patterns,
apparent file boundaries)
 - write the information to a file, read it back from a file
 - import the results of another file carving tool
 - run an xmagic search or other test over particular classes of data
blocks (e.g. high entropy blocks, or blocks not yet allocated to
files).  
 - automatically construct a test file from blocks according to a
rule, and attempt to validate it.

If you save the filesystem as a text file, you can use svn or cvs to
collaborate.

I assume you know about scalpel:
    http://www.digitalforensicssolutions.com/Scalpel/

I hope this is of some use :-)

           - Jim Van Zandt