shell, perl, performance, parallelism, profiling, etc. (was: Upgrade guidance)

Tue Oct 21 16:24:36 EDT 2008

On Oct 21, 2008, at 13:56, Ben Scott wrote:

> On Tue, Oct 21, 2008 at 12:05 PM, Bill McGonigle  
> <bill at bfccomputing.com> wrote:
>> I think we need keywords/tags for 'man -k' to use.
>
>   Well, there's "man -K", but it's sloooow.  I typically use Google as
> a substitute, but that failed me in this case.  I suspect I'd still
> fail though, even with a better "man", because the problem was I
> wasn't using a matching keyword.  GIGO.

Yeah, but tagging is nice in that you don't have to have a direct  
word match in the page and that tags are easy to index.  Would I be  
the first to think of comm as performing set operations?  I'm a bit  
fuzzy on how long man(1) should survive as a set of roff documents  
too, it has no idea what an Internet is.

>>  Is it because the shell tools
> fork, and children aren't counted?

If that were so my shell numbers should be lower.

>>   My box is FC8, 2.8 GHz Pentium D dual core, 1 GB RAM.  /var/lib/ 
>> rpm is 106 MB.

$du -sh  /var/lib/rpm
160M    /var/lib/rpm

$rpm -qa | wc -l
1715

$uname -a
Linux dhd.bfc 2.6.26.3-29.fc9.i686 #1 SMP Wed Sep 3 03:42:27 EDT 2008  
i686 i686 i386 GNU/Linux

$free
              total       used       free     shared    buffers      
cached
Mem:       1945152    1854348      90804          0     604852      
855912
-/+ buffers/cache:     393584    1551568
Swap:      2096376       1840    2094536

model name      : Intel(R) Pentium(R) 4 CPU 2.50GHz
stepping        : 9
cpu MHz         : 2500.226
cache size      : 512 KB

>   Hmmm, maybe the dual core also means multiple processes can run
> concurrently, while Perl is serializing everything?

ah, or perhaps time doesn't hop cores?  The numbers in your Perl run  
add up better than on your shell run.  Perl is poor at SMP (gah! perl  
threads!).

>   I would actually suspect sort(1).  There's no really good way to do
> a sort; just minimally poor ways (and lots of really poor ways).  And
> comm(1) only works on sorted files; otherwise, we wouldn't need to
> sort.  Perl, on the other hand, can use hashes (unsorted), which are
> typically much faster.  If we had a tool like comm(1) that used
> hashes, I suspect I'd see a further win.

Good point.  You should be able to hack my perl script to do that in  
about 10 minutes. :)

> comm -1 -2 \
>         <( package-cleanup --orphans | tail -n +2 | sort ) \
>         <( package-cleanup --leaves --all | tail -n +2 | sort )
>
>   Why do a pattern match when you can just skip the first line, right?
>  Except that yields a slightly *slower* typical performance for me,
> but now the *user* counter is showing up!
>
> real    0m4.970s
> user    0m5.935s
> sys     0m0.735s
>
>   What the heck??

Hrm, could there be something about the 'tail' pipe that causes CPU  
affinity?  I have no idea how SMP scheduling really works in linux.

-Bill

-----
Bill McGonigle, Owner           Work: 603.448.4440
BFC Computing, LLC              Home: 603.448.1668
bill at bfccomputing.com		Cell: 603.252.2606
http://www.bfccomputing.com/    Page: 603.442.1833
Blog: http://blog.bfccomputing.com/
VCard: http://bfccomputing.com/vcard/bill.vcf