shell, perl, performance, parallelism, profiling,	etc. (was: Upgrade guidance)
    Ben Scott 
    dragonhawk at gmail.com
       
    Tue Oct 21 13:56:21 EDT 2008
    
    
  
On Tue, Oct 21, 2008 at 12:05 PM, Bill McGonigle <bill at bfccomputing.com> wrote:
> I think we need keywords/tags for 'man -k' to use.
  Well, there's "man -K", but it's sloooow.  I typically use Google as
a substitute, but that failed me in this case.  I suspect I'd still
fail though, even with a better "man", because the problem was I
wasn't using a matching keyword.  GIGO.
> Note, only valid in bash, not sh.
  Yah, and it doesn't matter if you use "#!/bin/bash" at the top if
you're running the script with "sh foo.sh".  It took me five minutes
to figure that out just now.  D'oh.
> Efficiency is surprisingly worse with bash/comm, I don't get why:
  *Very* interesting.  I don't get anywhere near the same results.
I'm seeing the following as typical:
== Shell tools ==
real    0m4.650s
user    0m0.002s
sys     0m0.005s
== Perl ==
real    0m6.915s
user    0m6.135s
sys     0m0.642s
  I guess my home PC must be faster than your testbed in general
(seconds vs minutes), but shell tools are faster for me, vs Perl for
you.  Also, I don't get how the "user" counter can both be so low for
the shell tools, but so high for Perl.  Is it because the shell tools
fork, and children aren't counted?
> Almost all the work should be in the package-cleanup calls, not the set work.
  That depends on memory, I think.  If the entire RPM database can be
cached, performance may be quite different.  Maybe that's why I'm
seeing different numbers?
  My box is FC8, 2.8 GHz Pentium D dual core, 1 GB RAM.  /var/lib/rpm is 106 MB.
  Hmmm, maybe the dual core also means multiple processes can run
concurrently, while Perl is serializing everything?
  Hmmm.  A possible unforeseen (by me) advantage to shell tools:
Implicit parallelism.  Perhaps the recent trend towards multiple cores
will benefit shell scripts more than some newer stuff.
> Unless comm is O(n^2), but I didn't look at the source.
  I would actually suspect sort(1).  There's no really good way to do
a sort; just minimally poor ways (and lots of really poor ways).  And
comm(1) only works on sorted files; otherwise, we wouldn't need to
sort.  Perl, on the other hand, can use hashes (unsorted), which are
typically much faster.  If we had a tool like comm(1) that used
hashes, I suspect I'd see a further win.
  I also tweaked the shell code slightly:
comm -1 -2 \
        <( package-cleanup --orphans | sort ) \
        <( package-cleanup --leaves --all | sort ) \
        | grep -v 'Setting up yum'
  It should be more efficient to only grep once.  However, I tested
your variant as well.  On my box at least, eliminating the double grep
made less than 0.1 seconds real time difference.
  Hmmm, another idea:
comm -1 -2 \
        <( package-cleanup --orphans | tail -n +2 | sort ) \
        <( package-cleanup --leaves --all | tail -n +2 | sort )
  Why do a pattern match when you can just skip the first line, right?
 Except that yields a slightly *slower* typical performance for me,
but now the *user* counter is showing up!
real    0m4.970s
user    0m5.935s
sys     0m0.735s
  What the heck??
-- Ben
    
    
More information about the gnhlug-discuss
mailing list