FWIW: The bigger picture... Or why I have been asking a lot of questions lately...

Sun Oct 11 16:46:02 EDT 2009

Jim Kuzdrall wrote:
> Greetings Bruce,
>
>     Interesting and challenging project!
>
> On Saturday 10 October 2009 15:20, Bruce Labitt wrote:
>   
>> For anyone that is remotely interested, here is the big picture for
>> the problem I'm trying to solve.  If you are not interested, hey
>> delete the post.  Won't irritate me in the least!
>>
>>     
>     If you just transferred the data (no framing or error checking), how 
> many bits per second must you transfer to keep up with the FFT data 
> production?
>   
For the problems I'm doing now, the net cannot keep up.  At 800Mbps it 
would take ~1.6 sec to push the data and the engine computes a ~10M 
point complex double precision FFT in ~200ms.  10Gb ethernet would be 
nice, but I don't have the budget for this.  Even then, the transport 
would be 0.16sec vs the 0.2s compute.
>     Did you explore adding a dedicated FFT card to your control 
> computer?   The algorithms they build into the hardware are much, much 
> faster than compiled software.  The local board would keep the data in 
> your control computer - with DMA, I assume - eliminating the transfer 
> problem.
>
>   
I will look into it again.  Maybe the landscape has changed.  At one 
point I had to do 128M point FFTs - there wasn't any hardware to do that!

>     I know a fellow who now works for Apple whose job is to optimize FFT 
> algorithms to the processor they use.  Assembly language, of course.  
> Why is Apple interested?  Faster FFT, faster MP3 translation, longer 
> battery life.  A very high payoff.
>
> Jim Kuzdrall 
>
>
>   
I am using open source FFTW.  It is quite fast and it uses the 
platform's assets quite effectively.  Fortunately, it has been optimized 
for the Cell processor.  It runs 50-100X faster on my Cell than on my 
3.4GHZ P4, or whatever boat anchor I have.  I also tested the software 
on a couple of our servers.  The ratio is still way up near 50x.  The 
problem is that the cache gets exhausted and then the memory bus 
bandwidth gets saturated, this forms the upper limit of performance for 
the P4 / AMD64 class machines. 

The problem is indeed quite challenging.  I've gone down quite a few 
dead ends.  The list has seen some of my dead end attempts, but not all 
of them :)  I spared you some...

Bruce