Report on files by type

Benjamin Scott dragonhawk at gmail.com
Wed Mar 10 22:26:34 EST 2010


On Wed, Mar 10, 2010 at 5:33 PM, Greg Rundlett (freephile)
<greg at freephile.com> wrote:
> find -type f | egrep -o '\.(.?.?..)$' | sort | uniq -c

  Doesn't that regex miss files with single-character extensions, such
as C source (.c) and header (.h) files?  I would suggest instead:

	\.[^./]{1,4}$

  That will match from one-to-four not-dot-or-slash characters at the
end of a file name, preceded by a dot.  Excluding slash keeps grep
from matching a short file name in a short dot directory (which I
actually have some of, I was surprised to discover).  It will still
return dot files with very short file names, such as ".face", but I
don't think that's avoidable with just grep.  One would like to
request one-or-more characters precede the extension, but grep can
only output the entire match or the entire line, not just a
back-reference (which is what we really want).

  Here's a link to a Perl program I just whipped up that does a
moderately better job (I think/hope).  It shouldn't be fooled by
".face".  It ignores case differences, considers backup files to be
their "real" file extension, and reports on count *and* size.  It
should be fairly easy to extend for additional statistics, if you've
got any programming experience.

http://pastebin.com/g1WicJ16

> ... find out what was on the drive by 'content type' (mime-type) ...

  At the risk of belaboring the obvious, what we're doing here is
reporting on file name extensions, which isn't necessarily the same as
reporting on content type.  File names don't have to match their
content, files don't have to have an extension, some content types
have multiple extensions in common use (.jpg vs .jpeg, for example).

  Reporting on actual file content type would require the moral
equivalent of running file(1) on each file, which I expect would be
quite slow.

> http://diskuseanalyzer.sourceforge.net/

  Not familiar with that one.  For Win32, check out WinDirStat
(http://windirstat.info/), if you haven't already.  The "treemap"
graph is astoundingly useful.

> Is there something like "Disk Usage Analyzer" [1], which is great for a
> quick summary, plus offers the ability to drill down etc. which also offers
> other perspectives besides "size"? E.g.  report on mime-type, or group by
> age?

  Ah.  So, something GUI and interactive.

  Well, there is "baobab", which does "ring charts" and has come with
Ubuntu forever.

  A Google for "WinDirStat Linux" sees quite promising:

http://www.google.com/search?q=windirstat%20linux

  Checking the Synaptic package manager on my Debian 5.0 box for "disk
usage" finds some likely candidates.  Of the ones I just tried,
"gdmap" seems the best, followed by "kdirstat".

  I also just found "di", which is much like du(1), but with properly
formatted output columns.

-- Ben



More information about the gnhlug-discuss mailing list