TIP: yum hangs on futex()

Bill McGonigle bill at bfccomputing.com
Mon Jan 15 18:25:01 EST 2007


Since a server locked up a couple weeks ago yum would just hang  
trying to do anything.  I decided to tackle it today.

Doing an strace revealed the hang was after opening the files in /var/ 
lib/rpm and yum was waiting on a futex() call.

To make a long story short, it turns out RPM uses a BerkeleyDB and,  
as I seem to find in some unique situation every week, BerkeleyDB  
doesn't survive a hard reboot if it's in use when such a thing  
happens (usually power outages around here...).

The tell-tale sign, in general, is a trail of: __db.nnn files.  e.g.,  
in /var/lib/rpm:

$ls -l
total 144528
-rw-r--r-- 1 rpm  rpm   20131840 Jan 15 14:45 Basenames
-rw-r--r-- 1 rpm  rpm      12288 Jan 15 13:52 Conflictname
-rw-r--r-- 1 rpm  rpm    9764864 Jan 15 14:45 Dirnames
-rw-r--r-- 1 rpm  rpm   20750336 Jan 15 14:45 Filemd5s
-rw-r--r-- 1 rpm  rpm      45056 Jan 15 14:45 Group
-rw-r--r-- 1 rpm  rpm      36864 Jan 15 14:45 Installtid
-rw-r--r-- 1 rpm  rpm      86016 Jan 15 14:45 Name
-rw-r--r-- 1 rpm  rpm  104443904 Jan 15 14:45 Packages
-rw-r--r-- 1 rpm  rpm     663552 Jan 15 14:45 Providename
-rw-r--r-- 1 rpm  rpm     249856 Jan 15 14:45 Provideversion
-rw-r--r-- 1 rpm  rpm      12288 Jan 12 14:20 Pubkeys
-rw-r--r-- 1 rpm  rpm     897024 Jan 15 14:45 Requirename
-rw-r--r-- 1 rpm  rpm     442368 Jan 15 14:45 Requireversion
-rw-r--r-- 1 rpm  rpm     315392 Jan 15 14:45 Sha1header
-rw-r--r-- 1 rpm  rpm     167936 Jan 15 14:45 Sigmd5
-rw-r--r-- 1 rpm  rpm      12288 Jan 15 13:53 Triggername
-rw-r--r-- 1 root root         0 Jan 13 05:06 __db.000
-rw-r--r-- 1 root root     24576 Jan 13 05:04 __db.001
-rw-r--r-- 1 root root   1318912 Jan 13 05:04 __db.002
-rw-r--r-- 1 root root    450560 Jan 13 05:04 __db.003

To fix, this or any general case of BerkeleyDB corruption:

   kill any processes that look like they're waiting on the rpm database
   cd /var/lib/rpm
   db_recover

And, lo, yum works again.

Other things that have similarly crapped out on various servers in  
the past few weeks for me include postgrey and openldap.  The  
db_recover trick seems to work most of the time.  db_verify rarely  
reports anything useful but will reliably segfault on a broken  
database. ( what's the emoticon for 'bangs-head-on-wall' ?)

Good news: it looks like versions of RPM in CVS (please be in Fedora  
7...) will use SQLite for a back-end if available.

Plug: Come hear John Harris learn us about SQLite at DLSLUG on Feb. 1st.

-Bill

-----
Bill McGonigle, Owner           Work: 603.448.4440
BFC Computing, LLC              Home: 603.448.1668
bill at bfccomputing.com           Cell: 603.252.2606
http://www.bfccomputing.com/    Page: 603.442.1833
Blog: http://blog.bfccomputing.com/
VCard: http://bfccomputing.com/vcard/bill.vcf



More information about the gnhlug-discuss mailing list