shell, perl, performance, parallelism, profiling, etc. (was: Upgrade guidance)
Ben Scott
dragonhawk at gmail.com
Tue Oct 21 13:56:21 EDT 2008
On Tue, Oct 21, 2008 at 12:05 PM, Bill McGonigle <bill at bfccomputing.com> wrote:
> I think we need keywords/tags for 'man -k' to use.
Well, there's "man -K", but it's sloooow. I typically use Google as
a substitute, but that failed me in this case. I suspect I'd still
fail though, even with a better "man", because the problem was I
wasn't using a matching keyword. GIGO.
> Note, only valid in bash, not sh.
Yah, and it doesn't matter if you use "#!/bin/bash" at the top if
you're running the script with "sh foo.sh". It took me five minutes
to figure that out just now. D'oh.
> Efficiency is surprisingly worse with bash/comm, I don't get why:
*Very* interesting. I don't get anywhere near the same results.
I'm seeing the following as typical:
== Shell tools ==
real 0m4.650s
user 0m0.002s
sys 0m0.005s
== Perl ==
real 0m6.915s
user 0m6.135s
sys 0m0.642s
I guess my home PC must be faster than your testbed in general
(seconds vs minutes), but shell tools are faster for me, vs Perl for
you. Also, I don't get how the "user" counter can both be so low for
the shell tools, but so high for Perl. Is it because the shell tools
fork, and children aren't counted?
> Almost all the work should be in the package-cleanup calls, not the set work.
That depends on memory, I think. If the entire RPM database can be
cached, performance may be quite different. Maybe that's why I'm
seeing different numbers?
My box is FC8, 2.8 GHz Pentium D dual core, 1 GB RAM. /var/lib/rpm is 106 MB.
Hmmm, maybe the dual core also means multiple processes can run
concurrently, while Perl is serializing everything?
Hmmm. A possible unforeseen (by me) advantage to shell tools:
Implicit parallelism. Perhaps the recent trend towards multiple cores
will benefit shell scripts more than some newer stuff.
> Unless comm is O(n^2), but I didn't look at the source.
I would actually suspect sort(1). There's no really good way to do
a sort; just minimally poor ways (and lots of really poor ways). And
comm(1) only works on sorted files; otherwise, we wouldn't need to
sort. Perl, on the other hand, can use hashes (unsorted), which are
typically much faster. If we had a tool like comm(1) that used
hashes, I suspect I'd see a further win.
I also tweaked the shell code slightly:
comm -1 -2 \
<( package-cleanup --orphans | sort ) \
<( package-cleanup --leaves --all | sort ) \
| grep -v 'Setting up yum'
It should be more efficient to only grep once. However, I tested
your variant as well. On my box at least, eliminating the double grep
made less than 0.1 seconds real time difference.
Hmmm, another idea:
comm -1 -2 \
<( package-cleanup --orphans | tail -n +2 | sort ) \
<( package-cleanup --leaves --all | tail -n +2 | sort )
Why do a pattern match when you can just skip the first line, right?
Except that yields a slightly *slower* typical performance for me,
but now the *user* counter is showing up!
real 0m4.970s
user 0m5.935s
sys 0m0.735s
What the heck??
-- Ben
More information about the gnhlug-discuss
mailing list