Parallel Matrix Multiply
------------------------
This directory should contain the following files:

  README     - This file
  buildf77   - Script to build auto-parallel version of matmul.F using PGF77
  buildf77mp - Script to build OpenMP version of matmul.F using PGF77
  buildf90   - Script to build auto-parallel version of matmul.F using PGF90
  buildf90mp - Script to build OpenMP version of matmul.F using PGF90
  buildhpf   - Script to build HPF version of matmul.F using PGHPF
  matmul.F   - Parallel matrix multiplication timing program

The file matmul.F contains a matrix multiply timer coded in a way that can
be compiled/executed in parallel using auto-parallelization, OpenMP directive-
based parallelization, or HPF data parallel parallelization.  

Running on Linux or Solaris86
-----------------------------
To build and run this benchmark on Linux or Solaris86, your environment
must be properly initialized to use the PGI compilers.  If your environment
is not yet initialized, and assuming the PGI compilers are installed in
the directory /usr/local/pgi, issue the following commands:

	% setenv PGI /usr/local/pgi
	% set path=($PGI/<OS>/bin $path)
	% setenv MANPATH "$MANPATH":$PGI/man

where <OS> is replaced by either "linux86" or "solaris86" (without the
quotation marks). 

Auto-parallelization - Once your environment is initialized, issue the 
following commands to build and run the auto-parallelized matrix multiply 
using PGF77:

	% buildf77 

	< lots of output to your screen >

	% matmul_f77

	< timing output to your screen >

By default, the "matmul_f77" executable will use only one processor. 
You can run on 2 or more processors by setting the NCPUS environment 
variable.  If you're using csh, set NCPUS as follows:

	% setenv NCPUS 2

If using bash, sh, or ksh, set NCPUS as follows:

	% NCPUS=2 ; export NCPUS

and then execute "matmul_f77" again to see what kind of speedups you get.
Speedups will vary depending on the type of system you're using, but
should be anywhere from 50% to 95% when running on 2 processors over 1.
This same sequence of commands should also work using the "buildf90" 
script, which compiles using PGF90 rather than PGF77.

OpenMP parallelization - Once your environment is initialized, issue the 
following commands to build and run the OpenMP-parallelized matrix multiply 
using PGF77:

	% buildf77mp

	< lots of output to your screen >

	% matmul_f77mp

	< timing output to your screen >

By default, the "matmul_f77mp" executable will use only one processor. 
You can run on 2 or more processors by setting the OMP_NUM_THREADS 
environment variable.  If you're using csh, set OMP_NUM_THREADS as 
follows:

	% setenv OMP_NUM_THREADS 2

If using bash, sh, or ksh, set OMP_NUM_THREADS as follows:

	% OMP_NUM_THREADS=2 ; export OMP_NUM_THREADS

and then execute "matmul_f77mp" again to see what kind of speedups you get.
Speedups will vary depending on the type of system you're using, but
should be anywhere from 50% to 95% when running on 2 processors over 1.
This same sequence of commands should also work using the "buildf90mp" 
script, which compiles using PGF90 rather than PGF77.

HPF parallelization - Once your environment is initialized, issue the 
following commands to build and run the data parallel HPF matrix multiply 
using PGHPF:

	% buildhpf

	< lots of output to your screen >

	% matmul_hpf

	< timing output to your screen >

By default, the "matmul_hpf" executable will use only one processor. 
You can run on 2 or more processors by using the "-pghpf -np" runtime
option:

	% matmul_hpf -pghpf -np 2

	< timing output to your screen >


See the PGHPF User's Guide for details on how to run on multiple processors
across a distributed-memory cluster.  Speedups will vary depending on the 
type of system you're using, but should be anywhere from 50% to 95% when 
running on 2 processors over 1.


Running on Windows NT
-----------------------------
To build and run this benchmark on Windows NT, you need to be
working within a PGI Workstation BASH command window (that's the 
window that comes up when you double-click the PGI icon on your
Windows NT desktop). 

Auto-parallelization - issue the following commands to build and run 
the auto-parallelized matrix multiply using PGF77:

	% bash buildf77 

	< lots of output to your screen >

	% matmul_f77

	< timing output to your screen >

By default, the "matmul_f77" executable will use only one processor. 
You can run on 2 or more processors by setting the NCPUS environment 
variable:

	% NCPUS=2 ; export NCPUS

and then execute "matmul_f77" again to see what kind of speedups you get.
Speedups will vary depending on the type of system you're using, but
should be anywhere from 50% to 95% when running on 2 processors over 1.
This same sequence of commands should also work using the "buildf90" 
script, which compiles using PGF90 rather than PGF77.

OpenMP parallelization - issue the following commands to build and run 
the OpenMP-parallelized matrix multiply using PGF77:

	% bash buildf77mp

	< lots of output to your screen >

	% matmul_f77mp

	< timing output to your screen >

By default, the "matmul_f77mp" executable will use only one processor. 
You can run on 2 or more processors by setting the OMP_NUM_THREADS 
environment variable:

	% OMP_NUM_THREADS=2 ; export OMP_NUM_THREADS

and then execute "matmul_f77mp" again to see what kind of speedups you get.
Speedups will vary depending on the type of system you're using, but
should be anywhere from 50% to 95% when running on 2 processors over 1.
This same sequence of commands should also work using the "buildf90mp" 
script, which compiles using PGF90 rather than PGF77.

HPF parallelization - Once your environment is initialized, issue the 
following commands to build and run the data parallel HPF matrix multiply 
using PGHPF:

	% bash buildhpf

	< lots of output to your screen >

	% matmul_hpf

	< timing output to your screen >

By default, the "matmul_hpf" executable will use only one processor. 
You can run on 2 or more processors by using the "-pghpf -np" runtime
option:

	% matmul_hpf -pghpf -np 2

	< timing output to your screen >


See the PGHPF User's Guide for details on how to run on multiple processors
across a distributed-memory cluster.  Speedups will vary depending on the 
type of system you're using, but should be anywhere from 50% to 95% when 
running on 2 processors over 1.
