Backing up a little - Trying to get LAPACK to work...

Tue May 25 22:07:24 EDT 2010

Jerry Feldman wrote:
> On 05/25/2010 09:03 PM, Bruce Labitt wrote:
>   
>> Umm, my CLAPACK experiment is not doing so well.  (Reference Shot in the 
>> Dark Thread)  So I thought I'd try to interface to the "industry 
>> standard" LAPACK.  In the end, I expect to use CLAPACK, but I thought 
>> since MATLAB, GSL, SciPy, et. al use LAPACK, perhaps I could at least 
>> get some real work(TM) (coding) done.
>>
>> Fundamentally, the LAPACK results are the same as in CLAPACK, I suppose 
>> that is good in a way.  I rewrote everything in C using the accumulated 
>> knowledge I've gained.  Nearly everything is on the heap.  mallocs and 
>> frees where they belong.  When the 2x2 example is run, it works.  
>> Valgrind declared no leaks, no problems. 
>>
>> When the 9x9 example is run, it segfaults.  The program architecture is 
>> test1.c, svd.c.  test1 is "main".  svd.c is a wrapper function that 
>> actually calls the FORTRAN subroutine zgesvd_.  The segfault occurs when 
>> returning from svd.c, not returning from the FORTRAN subroutine.
>>
>> Valgrind reports that the routine does not know where to return to, 
>> i.e., the return address is 0x?  From what I've been told this is 
>> indicative of a stack error (overrun).
>>
>> If instead of using zgesvd_, I put in a dummy set of operation which 
>> actually write to all of the output matrices and then returns from 
>> svd.c, the program runs with no "error" for the 9x9 case.  I did this 
>> experiment to see if I was doing something wrong.
>>
>> Next I tried compiling with the -fstack-protector-all switch.  If I 
>> removed the dummy operations (put back to "normal") and ran the 9x9 
>> case, zgesvd_ gave results (reported INFO=0) which indicated success.  
>> The svd.c routine returned to main (test1.c) and printed out an entirely 
>> optimistic success message ;).  However, on the next instruction, which 
>> accesses the output arrays, the system segfaulted with a similar 0x? 
>> error.  In other words the main program can no longer access the arrays 
>> which it had malloc'ed (and had not yet freed).
>>
>> If I am interpreting this correctly then it seems there is a stack error 
>> of some sort in my compiled version of LAPACK.  Or? <smart people fill 
>> in the blank here, please!>
>>
>> Does anyone have an idea?
>> One thing that I can try is to use the "reference" LAPACK in my system 
>> and link to it.  That way I can hopefully take out the effect of my build.
>> Any other suggestions?
>>
>> Jeesh, this was supposed to be 'just' a port...  :-[
>>     
>
> What you describe is indicative of stack corruption. If the pointers
> that used to have the malloc'd addresses are null then either you failed
> to check the results of malloc, or something wrote into your local
> stack. This can happen when something goes beyond the boundaries of
> arrays. This is another thing that Purify does for you.
>
>
>   
OK, I thought it seemed like stack corruption, too. 

I know I didn't check the results of malloc directly - however, I made 
it a point to write to the arrays to initialize them to a known value.  
If I had a bad malloc, wouldn't the program have died during that 
initialization?  It was one of those arrays whose pointer got hosed.

Purify & Insure++ are looking pretty good right now...