shell, perl, performance, parallelism, profiling, etc. (was: Upgrade guidance)

Tue Oct 21 13:56:21 EDT 2008

On Tue, Oct 21, 2008 at 12:05 PM, Bill McGonigle <bill at bfccomputing.com> wrote:
> I think we need keywords/tags for 'man -k' to use.

  Well, there's "man -K", but it's sloooow.  I typically use Google as
a substitute, but that failed me in this case.  I suspect I'd still
fail though, even with a better "man", because the problem was I
wasn't using a matching keyword.  GIGO.

> Note, only valid in bash, not sh.

  Yah, and it doesn't matter if you use "#!/bin/bash" at the top if
you're running the script with "sh foo.sh".  It took me five minutes
to figure that out just now.  D'oh.

> Efficiency is surprisingly worse with bash/comm, I don't get why:

  *Very* interesting.  I don't get anywhere near the same results.
I'm seeing the following as typical:

== Shell tools ==
real    0m4.650s
user    0m0.002s
sys     0m0.005s

== Perl ==
real    0m6.915s
user    0m6.135s
sys     0m0.642s

  I guess my home PC must be faster than your testbed in general
(seconds vs minutes), but shell tools are faster for me, vs Perl for
you.  Also, I don't get how the "user" counter can both be so low for
the shell tools, but so high for Perl.  Is it because the shell tools
fork, and children aren't counted?

> Almost all the work should be in the package-cleanup calls, not the set work.

  That depends on memory, I think.  If the entire RPM database can be
cached, performance may be quite different.  Maybe that's why I'm
seeing different numbers?

  My box is FC8, 2.8 GHz Pentium D dual core, 1 GB RAM.  /var/lib/rpm is 106 MB.

  Hmmm, maybe the dual core also means multiple processes can run
concurrently, while Perl is serializing everything?

  Hmmm.  A possible unforeseen (by me) advantage to shell tools:
Implicit parallelism.  Perhaps the recent trend towards multiple cores
will benefit shell scripts more than some newer stuff.

> Unless comm is O(n^2), but I didn't look at the source.

  I would actually suspect sort(1).  There's no really good way to do
a sort; just minimally poor ways (and lots of really poor ways).  And
comm(1) only works on sorted files; otherwise, we wouldn't need to
sort.  Perl, on the other hand, can use hashes (unsorted), which are
typically much faster.  If we had a tool like comm(1) that used
hashes, I suspect I'd see a further win.

  I also tweaked the shell code slightly:

comm -1 -2 \
        <( package-cleanup --orphans | sort ) \
        <( package-cleanup --leaves --all | sort ) \
        | grep -v 'Setting up yum'

  It should be more efficient to only grep once.  However, I tested
your variant as well.  On my box at least, eliminating the double grep
made less than 0.1 seconds real time difference.

  Hmmm, another idea:

comm -1 -2 \
        <( package-cleanup --orphans | tail -n +2 | sort ) \
        <( package-cleanup --leaves --all | tail -n +2 | sort )

  Why do a pattern match when you can just skip the first line, right?
 Except that yields a slightly *slower* typical performance for me,
but now the *user* counter is showing up!

real    0m4.970s
user    0m5.935s
sys     0m0.735s

  What the heck??

-- Ben