
Parallelizing the NAS FT Benchmark:  A Tutorial
Introduction to the Usage of PGI Fortran and OpenMP

Keywords:  Fortran, Parallel, OpenMP, Linux, Solaris86, Windows NT

System Requirements:  Pentium Pro/II workstation or server with 2 or
more processors and a minimum of 64 MB of memory running Linux 2.0.3 or
higher, Solaris86 2.5 or higher, or Windows NT 4.0 or higher.  This
tutorial assumes the PGI Workstation compilers and tools are already
installed on the system.  See the PGI Workstation installation notes
if you need to install the PGI Workstation software prior to beginning
the tutorial.  The tutorial was developed on a Linux system with dual
150 MHZ Pentium Pro processors.  The execution of the entire example
program on 1 processor of this system, prior to optimization and 
parallelization, is just under 1 minute.

This tutorial is based around a serial version of the NAS FT benchmark,
one of the 8 NAS Parallel Benchmarks defined by the NAS Division at NASA
Ames Research Center.  By following the 6 steps outlined below, you will
progressively parallelize the program "fftpde.F" using the OpenMP parallel
capabilities of the PGI Fortran compiler(s), inserting directives where
necessary.  Upon completing this tutorial, you will have a basic working 
knowledge of how to use a PGI Fortran compiler (PGF77 or PGF90), how to use 
the PGPROF graphical profiler for performance tuning, how to parallelize 
a Fortran program, and how to execute a program in parallel on a 
multi-processor (SMP) workstation or server.

Basic knowledge of Fortran 77 programming is assumed.  The example 
command-lines below assume usage of the UNIX "vi" editor to modify the
file "fftpde.F", but any available file editor may be used.  If you are 
not familiar with "vi" or "emacs", you may want to use "pico" on Linux or 
"edit" on Windows NT.  They are less powerful but more intuitive.  For
an experienced Fortran programmer who is familiar with UNIX command-level 
compilers and tools, the tutorial takes 2-4 hours to complete, depending
on how much you try to do on your own without looking at the completed
examples.

You should find the following files in this directory (hereafter referred
to as the "working directory"):

	README		This File
	STEPS		A directory containing examples of how fftpde.F
			should look after each step is completed
	fftpde.F	The working copy of fftpde.F
	results		A file containing the output from a run of fftpde.F
			which gets correct results -- used for checking
			answers as each step is completed

If any of these files is missing or others are present, you have somehow 
received a modified or previously used version of the tutorial.  You should 
retrieve an unmodified copy to ensure successful completion of the tutorial.  
You can find this tutorial on the PGI Web site at 

	ftp://ftp.pgroup.com/pub/SMP/fftpde.tar.gz

In the steps that follow, "% " is used to represent the UNIX shell command
prompt (or PGI Workstation for Windows NT command prompt). This README file
assumes you are working in a "csh" command shell.

*******
STEP 1:  Setting up your environment to use the PGI Workstation compilers
*******

Linux or Solaris
----------------
If you are working on a Linux or Solaris system, you must know where the 
PGI Workstation compilers are installed in order to initialize your
environment properly.  Common choices for the installation are "/usr/pgi" 
or "/usr/local/pgi".  If you don't know where PGI Workstation is installed
on your system, ask your system administrator or the person who installed 
PGI Workstation for you.  Assuming the installation is in /usr/local/pgi
execute the following commands to initialize your environment:

	% setenv PGI /usr/local/pgi
	% set path=($PGI/solaris86/bin $path)
	% setenv MANPATH "$MANPATH":$PGI/man

In addition, you may want to bring up the online HTML PGI Workstation
manuals by opening the file /usr/local/pgi/doc in a web browser.

Windows NT
----------
If you are working on a Windows NT system with a properly installed version
of PGI Workstation, double-click the PGI Workstation icon on your desktop
with the left mouse button.  A command window should appear on your desktop.
The environment in the command window will be pre-initialized for usage of
the PGI Workstation compilers and tools.  Depending on how PGI Workstation
was installed, the command window will either be a UNIX-like command
shell (bash) or a Microsoft command window.  If you wish to use a command
window other than the default, select "Start -> Programs -> PGI Workstation"
with the left mouse button to choose an appropriate command window.

*******
STEP 2:  Compile and execute the original serial version of fftpde.F.
*******

Use the following commands to make a copy of the original serial version
of fftpde.F in the working directory, compile it using the PGF77 Fortran 77
compiler, execute it on your system, and check the results.

	% cp STEPS/fftpde.F fftpde.F
	% pgf77 fftpde.F -o fftpde
	% fftpde
	% diff fft.out results

The file "fft.out" is created and written by "fftpde".  It contains results
of the error checking and execution time in CPU seconds.  You should see no
differences in the numerical results from run to run, but your CPU times
will vary depending on the system you are using, the number of processors
you use, and the version of the program you are compiling.  To see the whole
"fft.out" file, type the following

	% cat fft.out

Now re-compile and re-execute fftpde.F using basic optimization with the
"-fast" compiler option:  

	% pgf77 -fast fftpde.F -o fftpde
	% fftpde
	% diff fft.out results

"fftpde" should execute a little faster this time.  The "-fast" option is 
equivalent to specifying the options "-O2 -Munroll -Mnoframe" individually.  
This set of options will generally provide near-optimal performance on a
single processor system with a minimum of effort.  By experimenting with 
other available options, further significant performance gains can sometimes
be realized.  If you want to see all of the available pgf77 command-line
options, enter the command:

	% man pgf77

or browse the PGI Workstation User's Guide in the online documentation.  Note
that the "man" command is not available in the PGI Workstation for Windows NT
command window, so you'll have to browse the manual on Windows NT.

Finally, re-compile "fftpde.F" adding the "-Minfo" option to the compile
line:

	% pgf77 -fast -Minfo fftpde.F -o fftpde

You should see output something like the following:

	fftpde:
	   157, Loop unrolled 4 times
	randlc:
	   233, Loop unrolled 4 times
	   238, Loop unrolled 4 times
	vranlc:
	   308, Loop unrolled 4 times
	   313, Loop unrolled 4 times
	cfft3:
	   401, Loop unrolled 5 times
	   405, Loop unrolled 5 times
	   417, Loop unrolled 5 times
	   421, Loop unrolled 5 times
	   433, Loop unrolled 5 times
	   437, Loop unrolled 5 times
	fft:
	   481, Invariant if transformation
	           copies loop at: 485
	   491, Loop unrolled 2 times
	        Loop unrolled 2 times
	   523, Loop unrolled 6 times
	tabfft:
	   535, Loop unrolled 5 times
	Linking:

The "-Minfo" switch provides what is essentially a compile-time optimization
listing.  Information is printed to the screen as compilation proceeds in
order to convey the types of optimizations or transformations performed,
(by program unit and line number).

*******
STEP 3:  Profile the serial version of fftpde.F using PGPROF.
*******

The first step in the optimization/parallelization process is to determine
where a program spends most of its execution time.  The PGPROF graphical
profiler is designed to convey this information in a simple and intuitive
way.  Re-compile and re-execute the program using the following commands:

	% pgf77 -fast -Minfo -Mprof=func fftpde.F -o fftpde
	% fftpde
	% diff fft.out results

You should see very little difference in the execution time from your
previous run.  The option -Mprof=func causes PGF77 to instrument "fftpde"
for function level profiling.  This allows you to see where execution time 
is spent on a function-by-function basis.  When you execute the program, a 
trace file called "pgprof.out" will be written to the working directory
at the completion of execution.  to view the profile, enter the command:

	% pgprof 

and the PGPROF graphical profiler should appear on your screen.  Notice
that most of the time is spent in 4 program units:  fft, cfft3, fftpde,
and vranlc.  However, we need a way to determine exactly which source
lines in those program units are consuming the most time.  Quit PGPROF
with a left mouse click on File -> Quit. 

In addition to function-level profiling, PGPROF supports line-level 
profiling.  Re-compile, re-execute, and profile with the following commands:

	% pgf77 -fast -Minfo -Mprof=lines fftpde.F -o fftpde
	% fftpde
	% diff fft.out results
	% pgprof

Note that with line profiling, the execution time of the program increases
significantly.  This is sometimes described by saying that line profiling
is "intrusive", meaning it affects execution time noticeably.  Even though
line-level profiling is intrusive, it's generally the case that the added
overhead is spread uniformly throughout the program.  Line profiling usually
provides valuable information about which source lines are consuming the most 
execution time.

Here's where you need to start thinking.  When PGPROF comes up, note that
"Time" for the "fft" and "cfft3" program units is higher than for "fftpde".  
However, the "Cost" of "fftpde" is higher than that of any other program 
unit.  "Time" represents the execution time spent within a particular
program unit.  "Cost" respresent the execution time spent within a 
particular program unit *and* all the subroutines or functions it calls.  
Your goal in parallelizing a program is to do so at as high a level 
as possible.  That means you want to work "top down", i.e. start as high 
in the call chain as you can.  To do that you want to start with program
units that are most expensive in terms of "Cost" rather than "Time".

Let's look at "fftpde".  Single-left-click in the PGPROF window on the 
line showing "Time" and "Cost" for "fftpde".  That line should now be 
highlighted (or "selected").  Now double-left-click the "fftpde" line.  A 
window containing the source code for "fftpde" should appear on your screen.
You may want to widen the "fftpde" source window.  You can do so by placing
the cursor over the right border of the PGPROF window -- so that the
cursor converts to a horizontal double-arrow line -- then left-clicking
and dragging the window to the desired width.

Now scroll down through the source.  Note that the call to CFFT3 at line 151
consumes most of the time.  To see what percentage of execution time is spent 
in CFFT3 (in addition to seconds), select "View -> Cost -> Percent of Cost".
You should see that 75% or more of the execution time is spent within the
call to CFFT3.  The loop at line 135, which contains the call to CFFT3,
is not parallelizable (why?).  Let's look insided CFFT3 and see if we can
parallelize it.  Exit the line-level window for "fftpde" by selecting
"File -> Close".

View the line profile for CFFT3 (see above on viewing the line info for
"fftpde" if you don't remember the sequence of commands).  Again, view the
"Cost" of each line as a percentage.  Scroll down and note that the
triply-nested loops at lines 399, 415, and 431 consume most of the execution
time.  Each of these loops is parallelizable!  Why?  Each is computing
sequences of 1D FFTs which are completely independent.  The loop at line
399 performs FFTs on the columns (i.e. in the 1st dimension), the loop at 
line 415 performs FFTs on the rows (second dimension), and and the loop at 
line 431 performs FFTs on the rays (3rd dimension) of the 3D complex
array X1, represented by "X1real" and "X2real".  Quit PGPROF by left-clicking
on "File -> Quit".

Before we start editing fftpde.F, we should make a point here.  PGPROF does
not determine which loops are parallelizable, only which loops consume the
most time.  You must determine by inspection or prior knowledge of the
algorithm which loops are parallelizable.  That is a topic which we do not
cover in this tutorial.  Quit PGPROF by selecting the appropriate command
with your mouse.

*******
STEP 4:  Parallelizing CFFT3
*******

Now edit fftpde.F using the editor of your choice.  We assume here that
you are using "vi".  Enter the editor by issuing the following command:

	% vi fftpde.F

Go to line 399.  The loop looks as follows:

          do k=1,n3
                do j=1,n2
                        do i=1,n1
                                z(i)=cmplx(x1real(i,j,k),x1imag(i,j,k))
                        end do
                        call fft(z,inverse,w,n1,m1)
                        do i=1,n1
                                x1real(i,j,k)=real(z(i))
                                x1imag(i,j,k)=aimag(z(i))
                        end do
                end do
          end do

If you are not at all familiar with OpenMP, stop and read the introduction
to Chapter 10 of the PGI Workstation User's Guide.  It describes in detail 
how to form OpenMP parallelization directives and how each directive 
functions.  In particular, you want to parallelize the "do k=1,n3" loop.  Can 
you figure out what directives to enter to make this happen?  Note that the 
array "z" is a work array that will need to be used by each processor 
separately.  How do you specify that "z" is local to each processor?  Try 
to determine what directives to insert.  If you need a hint, take a look at 
the file STEPS/fftpde_1.F, which contains a version of fftpde.F after the
necessary modifications.

Once you have parallelized the loop at line 399, do the same for the loops
around line 415 and around line 430 (the line numbers are changing as you
insert new source lines).  Then re-compile the program with the "-mp"
switch, which enables OpenMP directive processing:

	% pgf77 -fast -Minfo -mp fftpde.F -o fftpde

NOTE:  In releases 1.7-6 and prior, there is no -Minfo message to indicate
that OpenMP loops are parallelized.  This feature was added in the 3.0
release.  However, the loops will be parallelized as expected if the OpenMP
directives are properly formed.

Then re-execute the program as follows:

	% fftpde
	% diff fft.out results

You should see timings very similar to your original results above.  Now
to run in parallel execute the following commands if you are in a "csh" 
shell:

	% setenv OMP_NUM_THREADS 2
	% fftpde

Or the following if you are in a bourne shell, korn shell, or PGI Workstation
for Windows NT command shell:

	% OMP_NUM_THREADS=2
	% export OMP_NUM_THREADS
	% fftpde


NOTE:  If your program core dumps at this stage, it is probably because your
NOTE:  default stack size is not large enough.  Try typing the command:
NOTE:
NOTE:		% limit
NOTE:
NOTE:  From within a C shell to see what the stack size limit is.  On Solaris, 
NOTE:  issue the command 
NOTE:
NOTE:		% limit stacksize unlimited
NOTE:
NOTE:  and it should take care of the problem.  On Linux, you may have to
NOTE:  issue the following commands in order to work around the problem:
NOTE:
NOTE:		% su root
NOTE:		% csh
NOTE:		% unlimit stacksize
NOTE:		% su <your_username>
NOTE:
NOTE:  Until versions of Linux are available which allow non-privileged users 
NOTE:  to set the stacksize, you will have to use this workaround in order 
NOTE:  to run parallel programs with large amounts of data.
NOTE:
NOTE:  On Windows NT, your stacksize should grow automatically as needed.

Here you may see differing timings based on which operating system you are
working under.  Most operating systems (including Linux and Solaris) return 
aggregate CPU time from the "etime" timer call.  That means you must divide
the time reported by the number of processors to get approximate elapsed
time.  Once you've accounted for this factor, you should see an execution
time that is considerably faster than the single processor times (35% - 50%).

Now recompile with -Mprof=func and compare the 1 and 2 processor profiles
using the following commands:

	% pgf77 -fast -Minfo -mp -Mprof=func fftpde.F -o fftpde
	% fftpde
	% mv pgprof.out pgprof.out.2
	% setenv OMP_NUM_THREADS 1
	% fftpde
	% mv pgprof.out pgprof.out.1
	% pgprof pgprof.out.1 &
	% pgprof pgprof.out.2 &

Or the following if you are in a bourne shell, korn shell, or PGI Workstation
for Windows NT command shell:

	% pgf77 -fast -Minfo -mp -Mprof=func fftpde.F -o fftpde
	% fftpde
	% mv pgprof.out pgprof.out.2
	% OMP_NUM_THREADS=1
	% export OMP_NUM_THREADS
	% fftpde
	% mv pgprof.out pgprof.out.1
	% pgprof pgprof.out.1 &
	% pgprof pgprof.out.2 &

You should see near-linear speedup on the time spent in "fft".  The time
spend in "cfft3" may or may not show linear speed-up.  Most of the time in
"cfft3" itself is spent moving data back and forth between the X1 arrays
and the work vector.  Systems that are bandwidth-limited won't be able to
sustain both processors at full speed during this operation.

*******
STEP 5:  Parallelizing the remainder of fftpde.F
*******

Using steps similar to those outlined in STEP 4, progressively parallelize
the following loops:

	1) The DO 130 loop near line 64 (STEPS/fftpde_2.F)
	2) The DO 190 loop near line 115 (STEPS/fftpde_3.F)
	3) The DO 220 loop near line 139 (STEPS/fftpde_4.F)
	4) The DO 250 loop near line 155 (STEPS/fftpde_5.F)

Be careful in each case to declare variables PRIVATE if necessary.  A good
rule of thumb is that any variable being used as a work variable, i.e. which
is re-initialized with a new value upon each iteration of the loop, is 
probably an array you want to declare private.

When you have completed parallelizing all of fftpde.F, you should see
execution times on 2 processors that are between 60% and 90% faster than
the single-processor version.  Speedup is dependent on both the hardware
you are using and the size of the problem you are running.  To run the
actual Class A NAS FT benchmark, you need to set M1 = 8, M2 = 8, and
M3 = 7 in the PARAMETER statements near the top of the file fftpde.F  On
a system with sufficient memory bandwidth, you should see speedups close
to 2X on the Class A problem.  You will need to be running on a system
with 512MB of main memory in order to run the Class A problem.

*******
STEP 6:  Optional: Factoring Out Parallel Regions 
*******

OpenMP allows you to call subroutines from within parallel regions and
place "orphan" OpenMP directives within those subroutines.  For example
you can have sequences of code such as the following:

	PROGRAM TEST
         .
         .
!$OMP PARALLEL
         .
         .
        CALL TEST_SUBR (...)
         .
         .
!$OMP END PARALLEL
         .
         .
        END
C
        SUBROUTINE TEST_SUBR (...)
         .
         .
!$OMP DO
        DO I = 1, N
         .
         .
        ENDDO
!$OMP ENDDO
         .
         .
        RETURN
        END

Can you parallelize fftpde.F with fewer parallel regions than were used in
the steps above?  How few parallel regions can be used?  Does using fewer
parallel regions significantly affect the performance?  If you try this
optional step, beware of the need for synchronization barriers between
parallel loops and the need to define private variables properly.
