Report on files by type

G Rundlett greg.rundlett at gmail.com
Wed Mar 10 23:11:51 EST 2010


Indeed it does take a while to run "file" on tens of thousands of files.  I
came across this, http://www.pldaniels.com/filetype/ which is interesting
in case somebody else is looking for a library.  For my immediate needs, the
one-liner or script like Ben wrote is sufficient.

Thanks,

Greg Rundlett




On Wed, Mar 10, 2010 at 10:26 PM, Benjamin Scott <dragonhawk at gmail.com>wrote:

> On Wed, Mar 10, 2010 at 5:33 PM, Greg Rundlett (freephile)
> <greg at freephile.com> wrote:
> > find -type f | egrep -o '\.(.?.?..)$' | sort | uniq -c
>
>   Doesn't that regex miss files with single-character extensions, such
> as C source (.c) and header (.h) files?  I would suggest instead:
>
>        \.[^./]{1,4}$
>
>  That will match from one-to-four not-dot-or-slash characters at the
> end of a file name, preceded by a dot.  Excluding slash keeps grep
> from matching a short file name in a short dot directory (which I
> actually have some of, I was surprised to discover).  It will still
> return dot files with very short file names, such as ".face", but I
> don't think that's avoidable with just grep.  One would like to
> request one-or-more characters precede the extension, but grep can
> only output the entire match or the entire line, not just a
> back-reference (which is what we really want).
>
>  Here's a link to a Perl program I just whipped up that does a
> moderately better job (I think/hope).  It shouldn't be fooled by
> ".face".  It ignores case differences, considers backup files to be
> their "real" file extension, and reports on count *and* size.  It
> should be fairly easy to extend for additional statistics, if you've
> got any programming experience.
>
> http://pastebin.com/g1WicJ16
>
> > ... find out what was on the drive by 'content type' (mime-type) ...
>
>  At the risk of belaboring the obvious, what we're doing here is
> reporting on file name extensions, which isn't necessarily the same as
> reporting on content type.  File names don't have to match their
> content, files don't have to have an extension, some content types
> have multiple extensions in common use (.jpg vs .jpeg, for example).
>
>  Reporting on actual file content type would require the moral
> equivalent of running file(1) on each file, which I expect would be
> quite slow.
>
> > http://diskuseanalyzer.sourceforge.net/
>
>  Not familiar with that one.  For Win32, check out WinDirStat
> (http://windirstat.info/), if you haven't already.  The "treemap"
> graph is astoundingly useful.
>
> > Is there something like "Disk Usage Analyzer" [1], which is great for a
> > quick summary, plus offers the ability to drill down etc. which also
> offers
> > other perspectives besides "size"? E.g.  report on mime-type, or group by
> > age?
>
>   Ah.  So, something GUI and interactive.
>
>  Well, there is "baobab", which does "ring charts" and has come with
> Ubuntu forever.
>
>  A Google for "WinDirStat Linux" sees quite promising:
>
> http://www.google.com/search?q=windirstat%20linux
>
>  Checking the Synaptic package manager on my Debian 5.0 box for "disk
> usage" finds some likely candidates.  Of the ones I just tried,
> "gdmap" seems the best, followed by "kdirstat".
>
>  I also just found "di", which is much like du(1), but with properly
> formatted output columns.
>
> -- Ben
>
> _______________________________________________
> gnhlug-discuss mailing list
> gnhlug-discuss at mail.gnhlug.org
> http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.gnhlug.org/mailman/private/gnhlug-discuss/attachments/20100310/e2a67b69/attachment.html 


More information about the gnhlug-discuss mailing list