How can load imbalance be addressed with an OpenMP?
How c an a n O penMP i mprove t he s caling o f a n a ll-MPI
application?
Many applications cannot always be written to access the innermost dimension in the inner DO loop. For example, consider an
application that performs FFTs across the vertical lines of a grid
and then performs the transform across the horizontal lines of
a g rid. I n t he FFT example, code de velopers may u se a t ranspose to reorder the data for better performance when doing the
FFT. Th e extra work introduced by the transform is only worth
the eff ort if the FFT is large enough to b enefi t from the ability
to work on local contiguous data. Using a l ibrary FFT call that
allows strides in the data, measure the performance of the two
approaches (a) not using a t ranspose and (b) using a t ranspose
and determine the size of the FFT where (b) beats (a).