1.
Multiplication of a matrix and its transpose: Consider matrix
A
of size
N
by
M
and its transpose
AT
of size
M
by
N. Your task is to design and implement a parallel algorithm for multiplication of a matrix and its transpose, i.e.,
C
=
AAT
, for distributed-memory multi-computers in which the processors are organized as a one-dimensional linear array.
In the parallel algorithm design you must consider efficiency issues, i.e., try to minimize computation and communication costs and balance the workloads among all processors. Since the resulting matrix
C
is symmetric, i.e.,
cij
=
cji
, for example, in your algorithm only the elements in the upper (or lower) triangular of the matrix need to be calculated. (In other words you must not calculate both
cij
and
cji
as they are the same.)
In the implementation
a. You must use MPI non-blocking send/recv communication functions to overlap computation and communication.
b. You can assume
N
≥
p
for
p
being the number of processes organized as a one-dimensional linear array.
c. Your program must produce correct results for
p
being greater than or equal to one.
d. For simplicity you may restrict
p
to be either an odd, or even number to achieve the best possible load balancing.
e. Your program needs to ask for the matrix sizes
N
and
M
as user defined parameters, and must print out the results in the row-wise order as shown in an example below.
c00
c01
c02
c03
c04
c11
c12
c13
c14
c22
c23
c24
c33
c34
c44
After the parallel computation, you main program must conduct a self-checking, i.e., first perform a sequential computation using the same data set and then compare the two results.