Report on files by type
G Rundlett
greg.rundlett at gmail.com
Wed Mar 10 23:11:51 EST 2010
Indeed it does take a while to run "file" on tens of thousands of files. I
came across this, http://www.pldaniels.com/filetype/ which is interesting
in case somebody else is looking for a library. For my immediate needs, the
one-liner or script like Ben wrote is sufficient.
Thanks,
Greg Rundlett
On Wed, Mar 10, 2010 at 10:26 PM, Benjamin Scott <dragonhawk at gmail.com>wrote:
> On Wed, Mar 10, 2010 at 5:33 PM, Greg Rundlett (freephile)
> <greg at freephile.com> wrote:
> > find -type f | egrep -o '\.(.?.?..)$' | sort | uniq -c
>
> Doesn't that regex miss files with single-character extensions, such
> as C source (.c) and header (.h) files? I would suggest instead:
>
> \.[^./]{1,4}$
>
> That will match from one-to-four not-dot-or-slash characters at the
> end of a file name, preceded by a dot. Excluding slash keeps grep
> from matching a short file name in a short dot directory (which I
> actually have some of, I was surprised to discover). It will still
> return dot files with very short file names, such as ".face", but I
> don't think that's avoidable with just grep. One would like to
> request one-or-more characters precede the extension, but grep can
> only output the entire match or the entire line, not just a
> back-reference (which is what we really want).
>
> Here's a link to a Perl program I just whipped up that does a
> moderately better job (I think/hope). It shouldn't be fooled by
> ".face". It ignores case differences, considers backup files to be
> their "real" file extension, and reports on count *and* size. It
> should be fairly easy to extend for additional statistics, if you've
> got any programming experience.
>
> http://pastebin.com/g1WicJ16
>
> > ... find out what was on the drive by 'content type' (mime-type) ...
>
> At the risk of belaboring the obvious, what we're doing here is
> reporting on file name extensions, which isn't necessarily the same as
> reporting on content type. File names don't have to match their
> content, files don't have to have an extension, some content types
> have multiple extensions in common use (.jpg vs .jpeg, for example).
>
> Reporting on actual file content type would require the moral
> equivalent of running file(1) on each file, which I expect would be
> quite slow.
>
> > http://diskuseanalyzer.sourceforge.net/
>
> Not familiar with that one. For Win32, check out WinDirStat
> (http://windirstat.info/), if you haven't already. The "treemap"
> graph is astoundingly useful.
>
> > Is there something like "Disk Usage Analyzer" [1], which is great for a
> > quick summary, plus offers the ability to drill down etc. which also
> offers
> > other perspectives besides "size"? E.g. report on mime-type, or group by
> > age?
>
> Ah. So, something GUI and interactive.
>
> Well, there is "baobab", which does "ring charts" and has come with
> Ubuntu forever.
>
> A Google for "WinDirStat Linux" sees quite promising:
>
> http://www.google.com/search?q=windirstat%20linux
>
> Checking the Synaptic package manager on my Debian 5.0 box for "disk
> usage" finds some likely candidates. Of the ones I just tried,
> "gdmap" seems the best, followed by "kdirstat".
>
> I also just found "di", which is much like du(1), but with properly
> formatted output columns.
>
> -- Ben
>
> _______________________________________________
> gnhlug-discuss mailing list
> gnhlug-discuss at mail.gnhlug.org
> http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.gnhlug.org/mailman/private/gnhlug-discuss/attachments/20100310/e2a67b69/attachment.html
More information about the gnhlug-discuss
mailing list