Re: omp question



George Trojan wrote:
Gib Bogle wrote:

George Trojan wrote:

What is that I do not understand? The program listed below produces execution times

wx22gt> ./try 40000000
num threads = 32
do loop: 0.5078125000E-01
where + array: 1.023437500
STOP 0

but, when I comment out the omp statements, the "where" statement is faster:

wx22gt> ./try 40000000
do loop: 0.3007812500
where + array: 0.8085937500
STOP 0

Operating system AIX 5.3, the compile/link statements are
xlf95_r -g -O0 -qfree=f90 -qsuffix=f=f90 -qnosave -qsmp=omp -c try.f90
xlf95_r -g -O0 -qfree=f90 -qsuffix=f=f90 -qnosave -qsmp=omp -o try try.o

I just started experimenting with Open MP, so I must be missing something.

George

module try_internal

implicit none

contains

function get_time()
real :: get_time

integer :: date_time(8)

call date_and_time(values=date_time)
get_time = date_time(5)*3600 + date_time(6)*60 + date_time(7) + &
date_time(8)*1.0e-3
end function get_time

end module try_internal

program try
use omp_lib
use try_internal
implicit none

real, dimension(:), allocatable :: a
character(len=256) :: buf
integer :: n, i, k
real :: t
real, parameter :: eps = 0.2

call getarg(1, buf)
read(buf, *) n
allocate(a(n))
call random_number(a)
t = get_time()
!$omp parallel
if (omp_get_thread_num() == 1) &
print *, 'num threads = ', omp_get_num_threads()
!$omp do
do i = 1, n
if (a(i) < eps) a(i) = a(i) + 1.0
end do
!$omp end do
!$omp end parallel
print *, 'do loop: ', get_time() - t
call random_number(a)
t = get_time()
!$omp parallel workshare
where (a < eps) a = a + 1.0
!$omp end parallel workshare
print *, 'where + array: ', get_time() - t

stop 0

end program try


Using more processors doesn't necessarily speed up a program. You have to think carefully about caching issues. I don't know the correct terminology, but the basic idea is that when a CPU accesses memory it transfers a chunk into a cache, and this chunk has a size that may exceed the size of the element you are changing. This speeds up processing (on average) because you often want to perform repeated operations on the same memory location or on adjacent locations. This is a complex subject of which I have only an inkling:
http://en.wikipedia.org/wiki/CPU_cache
but I do know that this issue can have a serious impact on shared-memory multiprocessing. The reason is that a CPU in one thread often has to wait for a CPU in another thread to release a chunk of memory, if the two threads are operating on adjacent memory locations (e.g. on elements within the same array). This can happen when the cache chunk is bigger than the array element size (in your case 4 bytes).

I encountered this unpleasant surprise in my OpenMP coding, and have developed coding procedures to minimize the slowdown. If I can't avoid having threads trying to access adjacent elements in an array a significant fraction of the time, I resort to the crude but effective expedient of padding out the array elements so that they are at least a cache-chunk apart. For example, if the cache-chunk is M bytes, instead of a(i) I use a((M/4)*i), where the dimension of a is now (M/4)*n.

I suggest you experiment with this, and explore the effect on timing of different values of M.

By the way, random_number() in a parallel loop is a real trap for young players. The reason is that the random number generator maintains a global seed value - the state of the RNG. This value is accessed and changed by all the threads, leading to contention. Since I do a lot of Monte Carlo simulations I wrote my own parallel RNG, in which each thread generates a random sequence independent of the other threads (each has its own seed). When I get to work I'll send you a document I wrote on this subject, if you are interested and if you provide your email address.

Gib


I understand this. My point was (or at least the one I wanted to make) that the do loop was running about 6 times faster (0.05s vs 0.3s) with OMP, while the equivalent where statement was a bit slower - it looked as the OMP overhead was taken into account, but the implied loop in where and array arithmetics was not executed in parallel.
The call to random_number() is outside the parallel region, I think.

Sorry for going off on a tangent :-)
.



Relevant Pages

  • Re: OpenMP problem
    ... A step-by-step debugging show me that, inside an omp do loop, the ... corresponding array and I got segfault. ... overrunning an array bound somewhere. ...
    (comp.lang.fortran)
  • Re: omp question
    ... do loop: 0.5078125000E-01 ... where + array: 1.023437500 ... but, when I comment out the omp statements, the "where" statement is faster: ... I don't know the correct terminology, but the basic idea is that when a CPU accesses memory it transfers a chunk into a cache, and this chunk has a size that may exceed the size of the element you are changing. ...
    (comp.lang.fortran)
  • Re: omp question
    ... do loop: 0.5078125000E-01 ... where + array: 1.023437500 ... but, when I comment out the omp statements, the "where" statement is faster: ... I don't know the correct terminology, but the basic idea is that when a CPU accesses memory it transfers a chunk into a cache, and this chunk has a size that may exceed the size of the element you are changing. ...
    (comp.lang.fortran)
  • openMP reduction with arrays
    ... !$OMP PARALLEL PRIVATE ... each thread will have a private copy of 'array' which will be updated by some subroutine manip: ...
    (comp.lang.fortran)
  • Re: OpenMP problem
    ... (snip on OpenMP problem) ... A step-by-step debugging show me that, inside an omp do loop, the ... corresponding array and I got segfault. ...
    (comp.lang.fortran)