Re: Program speed execution question




"Richard E Maine" <nospam@xxxxxxxxxxxxx> wrote in message
news:nospam-0F7BD0.12463223042005@xxxxxxxxxxxxxxxxxxxxx
> In article <v61l611sl8b9p3v6djo4gv6uu3nhjif697@xxxxxxx>,
> Joe Hill <georgecostanz50@xxxxxxxxxxx> wrote:
>
>> We have a program that we are running on both Xeon 32 bit and Opteron 64
>> bit
>> cpus. The program runs much faster on the 32 bit Xeon processors. The
>> run
>> time
>> (wall clock) is as follows :
>>
>> Xeon 32 bit = .017 wall clock-hours
>> Opteron 64 bit = .309 wall clock-hours
>
> Some difference could be explained several ways, but that's a pretty
> darned big difference for any of the explanations. You say you were
> using the same compiler for both (or anyway, that's how I interpreted
> what you said), but maybe it is just the same version number. Anyway, I
> can't explain that part.
>
>> The internal customer then examined the code and changed the way arrays
>> are
>> allocated.
>>
>> Old Way : [pointers]
>> New Way : [allocatables]
>
>> Changing the array allocation decreased the wall clock time to almost
>> nothing
>> on
>> both types of cpus according to our internal customer.
>> Can anyone explain
>
> Others have talked about aliasing, but I'd guess that to be the wrong
> explanation here. Aliasing can be important, but I wouldn't expect to
> see changes quite as big as you describe except possibly in the most
> contrived special cases. However...
>
> I have personally seen *HUGE* differences between allocatable and
> pointer arrays because allocatables are known at compile time to be
> contiguous, whereas pointers are not. In some compilers, this causes
> unnecessary copy-in/copy-out operations. That can result in performance
> penalties that are almost arbitrarily large when huge arrays get copied
> around just to perform trivial operations on single elements.
Aliasing could account for as much as a factor of 5 in performance on the
Xeon, if it makes the difference between vectorizing or not. Not as much
difference on the Opteron, but still significant, for single precision. A
larger factor might come about, if temporary arrays were allocated in an
inner loop, but can be eliminated by optimization with the new declaration.


.