-------------------------------------------------------------------------------
Name of Program : POLMP
(Proudman Oceanographic Laboratory Multiprocessing Program)
-------------------------------------------------------------------------------
Submitter's Name : Mike Ashworth
Submitter's Organization: NERC Computer Services
Submitter's Address : Bidston Observatory
Birkenhead, L43 7RA, UK
Submitter's Telephone # : +44-51-653-8633
Submitter's Fax # : +44-51-653-6269
Submitter's Email : M.Ashworth@ncs.nerc.ac.uk
-------------------------------------------------------------------------------
Major Application Field : Fluid Dynamics
Application Subfield(s) : Ocean and Shallow Sea Modeling
-------------------------------------------------------------------------------
Application "pedigree" (origin, history, authors, major mods) :
The POLMP project was created to develop numerical
algorithms for shallow sea 3D hydrodynamic models that run
efficiently on modern parallel computers. A code was
developed, using a set of portable programming conventions
based upon standard Fortran 77, which follows the wind
induced flow in a closed rectangular basin including a number
of arbitrary land areas. The model solves a set of
hydrodynamic partial differential equations, subject to a set of
initial conditions, using a mixed explicit/implicit forward time
integration scheme. The explicit component corresponds to a
horizontal finite difference scheme and the implicit to a
functional expansion in the vertical (Davies, Grzonka and
Stephens, 1989).
By the end of 1989 the code had been implemented on the RAL
4 processor Cray X-MP using Cray's microtasking system,
which provides parallel processing at the level of the Fortran
DO loop. Acceptable parallel performance was achieved by
integrating each of the vertical modes in parallel, referred to
in Ashworth and Davies (1992) as vertical partitioning. In
particular, a speed-up of 3.15 over single processor execution
was obtained, with an execution rate of 548 Megaflops
corresponding to 58 per cent of the peak theoretical
performance of the machine. Execution on an 8 processor Cray
Y-MP gave a speed-up efficiency of 7.9 and 1768 Megaflops or
67 per cent of the peak (Davies, Proctor and O'Neill, 1991).
The latter resulted in the project being presented with a
prize in the 1990 Cray Gigaflop Performance Awards .
The project has been extended by implementing the shallow
sea model in a form which is more appropriate to a variety of
parallel architectures, especially distributed memory
machines, and to a larger number of processors. It is especially
desirable to be able to compare shared memory parallel
architectures with distributed memory architectures. Such a
comparison is currently relevant to NERC science generally
and will be a factor in the considerations for the purchase of
new machines, bids for allocations on other academic
machines, and for the design of new codes and the
restructuring of existing codes.
In order to simplify development of the new code and to ensure
a proper comparison between machines, a restructured version
of the original code was designed which will
perform partitioning of the region in the horizontal dimension.
This has the advantage over vertical partitioning that the
communication between processors is limited to a few points
at the boundaries of each sub-domain. The ratio of interior
points to boundary points, which determines the ratio of
computation to communication and hence the efficiency on
message passing, distributed memory machines, may be
increased by increasing the size of the individual sub-domains.
This design may also improve the efficiency on shared memory
machines by reducing the time of the critical section and
reducing memory conflicts between processors. In addition, the
required number of vertical modes is only about 16, which,
though well suited to a 4 or 8 processor machine, does not
contain sufficient parallelism for more highly parallel
machines.
The code has been designed with portability in mind, so that
essentially the same code may be run on parallel computers
with a range of architectures.
-------------------------------------------------------------------------------
May this code be freely distributed (if not specify restrictions) :
Yes, but users are requested to acknowledge the author (Mike Ashworth)
in any resulting research or publications, and are encouraged to send
reprints of their work with this code to the authors. Also, the authors
would appreciate being notified of any modifications to the code.
-------------------------------------------------------------------------------
Give length in bytes of integers and floating-point numbers that should be
used in this application:
Some 8 byte floating point numbers are used in some of the initialization
code, but calculations on the main field arrays may be done using
4 byte floating point variables without grossly affecting the solution.
Nevertheless, precision conversion is facilitated by a switch supplied
to the C preprocessor. By specifying -DSINGLE, variables will be declared
as REAL, normally 4 bytes, whereas -DDOUBLE will cause declarations to be
DOUBLE PRECISION, normally 8 bytes.
-------------------------------------------------------------------------------
Documentation describing the implementation of the application (at module
level, or lower) :
The documentation supplied with the code describes how the various versions
of the code should be built. Extensive documentation, including the
definition of all variables in COMMON is present as comments in the code.
-------------------------------------------------------------------------------
Research papers describing sequential code and/or algorithms :
1) Davies, A.M., Formulation of a linear three-dimensional hydrodynamic
sea model using a Galerkin-eigenfunction method, Int. J. Num. Meth.
in Fliuds, 1983, Vol. 3, 33-60.
2) Davies, A.M., Solution of the 3D linear hydrodynamic equations using
an enhanced eigenfunction approach, Int. J. Num. Meth. in Fluids,
1991, Vol. 13, 235-250.
-------------------------------------------------------------------------------
Research papers describing parallel code and/or algorithms :
1) Ashworth, M. and Davies, A.M., Restructuring three-dimensional
hydrodynamic models for computers with low and high degrees of
parallelism, in Parallel Computing '91, eds D.J.Evans, G.R.Joubert
and H.Liddell (North Holland, 1992), 553-560.
2) Ashworth, M., Parallel Processing in Environmental Modelling, in
Proceedings of the Fifth ECMWF Workshop on Use of Parallel Processors in
Meteorology (Nov. 23-27, 1992)
Hoffman, G.-R and T. Kauranne, ed.,
World Scientific Publishing Co. Pte. Ltd, Singapore, 1993.
3) Ashworth, M. and Davies, A.M., Performance of a Three Dimensional
Hydrodynamic Model on a Range of Parallel Computers, in
Proceedings of the Euromicro Workshop on Parallel and Distributed
Computing, Gran Canaria 27-29 January 1993, pp 383-390, (IEEE
Computer Society Press)
4) Davies, A.M., Ashworth, M., Lawrence, J., O'Neill, M.,
Implementation of three dimensional shallow sea models on vector
and parallel computers, 1992a, CFD News, Vol. 3, No. 1, 18-30.
5) Davies, A.M., Grzonka, R.G. and Stephens, C.V., The implementation
of hydrodynamic numerical sea models on the Cray X-MP, 1992b, in
Advances in Parallel Computing, Vol. 2, edited D.J. Evans.
6) Davies, A.M., Proctor, R. and O'Neill, M., "Shallow Sea
Hydrodynamic Models in Environmental Science", Cray Channels,
Winter 1991.
-------------------------------------------------------------------------------
Other relevant research papers:
-------------------------------------------------------------------------------
Application available in the following languages (give message passing system
used, if applicable, and machines application runs on) :
Fortran 77 version
A sequential version of POLMP is available, which conforms
to the Fortran 77 standard. This version has been run on a
large number of machines from workstations to supercomputers
and any code which caused problems, even if it conformed to
the standard, has been changed or removed. Thus its conformance
to the Fortran 77 standard is well established.
In order to allow the code to run on a wide range of problem
sizes without recompilation, the major arrays are defined
dynamically by setting up pointers, with names starting with
IX, which point to locations in a single large data array: SA.
Most pointers are allocated in subroutine MODSUB and the
starting location passed down into subroutines in which they
are declared as arrays. For example :
IX1 = 1
IX2 = IX1 + N*M
CALL SUB ( SA(IX1), SA(IX2), N, M )
SUBROUTINE SUB ( A1, A2, N, M )
DIMENSION A1(N,M), A2(N,M)
END
Although this is probably against the spirit of the Fortran 77
standard, it is considered the best compromise between
portability and utility, and has caused no problems on any of
the machines on which it has been tried.
The code has been run on a number of traditional vector
supercomputers, mainframes and workstations. In addition,
key loops are able to be parallelized automatically by some
compilers on shared (or virtual shared) memory MIMD machines,
allowing parallel execution on the Convex C2 and C3, Cray X-MP,
Y-MP, and Y-MP/C90, and Kendall Square Research KSR-1. Cray
macrotasking calls may also be enabled for an alternative
mode of parallel execution on Cray multiprocessors.
PVM and Parmacs versions
POLMP has been implemented on a number of message-passing machines:
Intel iPSC/2 and iPSC/860, Meiko CS-1 i860 and CS-2 and nCUBE 2.
Code is also present for the PVM and Parmacs portable message
passing systems, and POLMP has run successfully, though not
efficiently, on a network of Silicon Graphics workstations.
Calls to message passing routines are concentrated
in a small number of routines for ease of portability and
maintenance. POLMP performs housekeeping tasks on one node of the
parallel machine, usually node zero, referred to in the code as the
driver process, the remaining processes being workers.
PVM and Parmacs are software environments for implementing the
message-passing parallel programming model. Portability is
achieved in PVM using library calls and in Parmacs using macro
definitions which then are translated into machine dependent code,
or calls to machine dependent library routines. They have both been
implemented on a wide range of parallel architectures, and have thus
become de facto standards for message-passing codes. However, they
are both likely to be superceded by the Message Passing Interface
when it comes into widespread usage.
Parmacs version 5 requires the use of a host process
which had not been used previously in the POLMP code. Message-
passing POLMP performs housekeeping tasks on one node of the
parallel machine, referred to in the code as the driver process,
the remaining processes being workers. For Parmacs, a simple
host program has been provided which loads the node program
onto a two dimensional torus and then takes no further part
in the run, other than to receive a completion code from the
driver, in case terminating the host early would interfere with
execution of the nodes.
Subset-HPF version
A data parallel version of the code has been run on the
Thinking Machines CM-2, CM-200 and MasPar MP-1 machines.
High Performance Fortran (HPF) defines extensions to the
Fortran 90 language in order to provide support for parallel
execution on a wide variety of machines using a data parallel
programming model.
The subset-HPF version of the POLMP code has been written
to the draft standard specified by the High Performance
Fortran Forum in the HPF Language Specification version 0.4
dated November 6, 1992. Fortran 90 code was developed on a
Thinking Machines CM-200 machine and checked for
conformance with the Fortran 90 standard using the
NAGWare Fortran 90 compiler. HPF directives were inserted
by translating from the CM Fortran directives, but have not
been tested due to the lack of access to an HPF compiler. The
only HPF features used are the PROCESSORS, TEMPLATE,
ALIGN and DISTRIBUTE directives and the system inquiry
intrinsic function NUMBER_OF_PROCESSORS.
In order to allow the code to run on a wide range of problem
sizes without recompilation, the major arrays are defined in
subroutine MODSUB using automatic array declarations with
the dimensions of the arrays being passed as subroutine
arguments.
-------------------------------------------------------------------------------
Total number of lines in source code: 26,000 (approx.)
Number of lines excluding comments : 13,000
Size in bytes of source code : 700,000
-------------------------------------------------------------------------------
List input files (filename, number of lines, size in bytes, and if formatted) :
steering file: 48 lines, 2170 bytes, ascii (typical size)
-------------------------------------------------------------------------------
List output files (filename, number of lines, size in bytes, and if formatted) :
standard output: 700 lines, 62,000 bytes, ascii (typical size)
-------------------------------------------------------------------------------
Brief, high-level description of what application does:
POLMP solves the linear three-dimensional hydrodynamic equations
for the wind induced flow in a closed rectangular basin of constant depth
which may include an arbitrary number of land areas.
-------------------------------------------------------------------------------
Main algorithms used:
The discretized form of the hydrodynamic equations are solved for field
variables, z, surface elevation, and u and v, horizontal components of
velocity. The fields are represented in the horizontal by a staggered
finite difference grid. The profile of vertical velocity with depth
is represented by the superposition of a number of spectral components.
The functions used in the vertical are arbitrary, although the
computational advantages of using eigenfunctions (modes) of the eddy
viscosity profile have been demonstrated (Davies, 1983). Velocities
at the closed boundaries are set to zero.
Each timestep in the forward time integration of the model, involves
successive updates to the three fields, z, u and v. New field values
computed in each update are used in the subsequent calculations. A
five point finite difference stencil is used, requiring only nearest
neighbours on the grid.
A number of different data storage and data processing methods is
included mainly for handling cases with significant amounts of land,
e.g. index array, packed data. In particular the program may be
switched between masked operation, more suitable for vector processors,
in which computation is done on all points, but land and boundary points
are masked out, and strip-mining, more suitable for scalar and RISC
processors, in which calculations are only done for sea points.
-------------------------------------------------------------------------------
Skeleton sketch of application:
The call chart of the major subroutines is represented thus:
AAAPOL -> APOLMP -> INIT
-> RUNPOL -> INIT2 -> MAP -> GLOBAL -> SETDEP -> INDP
-> IRISH -> INTERP
-> EIRISH
-> GRAPH
-> DIVIDE
-> PRMAP
-> SNDWRK
-> RCVWRK
-> SETUP -> GENSTP
-> SPEC -> ROOTS -> TRANS
-> MODSUB -> MODEL -> ASSIGN -> GENMSK
-> GENSTP
-> GENIND
-> GENPAC
-> METRIC -> STATS
-> SEQ
-> CLRFLD
-> TIME* -> SNDBND
-> PSTBND
-> RCVBND
-> RESULT
-> SNDRES
-> RCVRES
-> MODOUT -> OZUVW -> OUTFLD -> GETRES
-> OUTARR
-> GRYARR
-> WSTATE
AAAPOL is a dummy main program calling APOLMP. APOLMP calls INIT which
reads parameters from the steering file, checks and monitors them.
RUNPOL is then called which calls another initialization routine INIT2.
Called from INIT2, MAP forms a map of the domain to be modelled, GLOBAL,
IRISH and EIRISH are hard-wired maps of three real-worls models, GRAPH
generates a connection graph for the network of grid points, DIVIDE
divides the domain between processors, and PRMAP maps sub-domains onto
processes.
SNDWRK on the driver process sends details of the sub-domain to be
worked on to each worker. RCVWRK receives that information. SETUP
does some array allocation, GENSTP generates indexes for strip-mining
and SPEC, ROOTS and TRANS set up the coefficients for the spectral
expansion. MODSUB does the main allocation of array space to the field
and ancillary arrays. MODEL is the main driver subroutine for the model.
ASSIGN calls routines to generate masks strip-mining indexes, packing
indexes and measurement metrics. CLRFLD initializes the main data arrays.
One of seven time-stepping routines, TIME*, is chosen dependent
on the vectorization and packing/indexing method used to cope with the
presence of land and boundary data. SNDBND and RCVBND handle the sending
and reception of boundary data between sub-domains. After the required
number of time-steps is complete, RESULT saves results from the desired
region, and SNDRES, on the workers and RCVRES on the driver collect the
result data. MODOUT handles the writing of model output to standard output
and disk files, as required.
For a non-trivial run, 99% of time is spent in whichever of the
timestepping routines, TIME*, has been chosen.
-------------------------------------------------------------------------------
Brief description of I/O behavior:
The driver process, usually processor 0, reads in the input parameters
and broadcasts them to the rest of the processors. The driver also receives
the results from the other processors and writes them out.
-------------------------------------------------------------------------------
Describe the data distribution (if appropriate) :
The processors are treated as a logical 2-D grid. The simulation domain
is divided into a number of sub-domains which are allocated, one sub-domain
per processor.
-------------------------------------------------------------------------------
Give parameters of the data distribution (if appropriate) :
The number of processors, p, and the number of sub-domains are provided
as steering parameters, as is a switch which requests either one-dimensional
or two-dimensional partitioning.
Partitioning is only actually carried out for the message passing versions
of the code. For two-dimensional partitioning p is factored into px and py
where px and py are as close as possible to sqrt(p).
For the data parallel version the number of sub-domains is set to one
and decomposition is performed by the compiler via data distribution
directives.
-------------------------------------------------------------------------------
Brief description of load balance behavior :
Unless land areas are specified, the load is fairly well balanced.
If px and py evenly divide the number of grid points, then the
model is perfectly balanced except that boundary sub-domains have
fewer communications.
No tests with land areas have yet been performed with the parallel
code, and more sophisticated domain decomposition algorithms have
not yet been included.
-------------------------------------------------------------------------------
Give parameters that determine the problem size :
nx, ny Size of horizontal grid
m Number of vertical modes
nts Number of timesteps to be performed
-------------------------------------------------------------------------------
Give memory as function of problem size :
See below for specific examples.
-------------------------------------------------------------------------------
Give number of floating-point operations as function of problem size :
Assuming stanrdard compiler optimizations, there is a requirement for
29 floating point operations (18 add/subtracts and 11 multiplies) per
grid point, so the total computational load is
29 * nx * ny * m * nts
-------------------------------------------------------------------------------
Give communication overhead as function of problem size and data distribution :
During each timestep each sub-domain of size nsubx=nx/px by nsuby=ny/py
requires the following communications in words :
nsubx * m from N
nsubx from S
nsubx * m from S
nsuby * m from W
nsuby from E
nsuby * m from E
m from NE
m from SW
making a total of
(2 * m + 1)*(nsubx * nsuby) + 2*m words
in eight messages from six directions.
-------------------------------------------------------------------------------
Give three problem sizes, small, medium, and large for which the benchmark
should be run (give parameters for problem size, sizes of I/O files,
memory required, and number of floating point operations) :
The data sizes and computational requirements for the three
benchmarking problems supplied are :
Name nx x ny x m x nts Computational Memory
Load (Gflop) (Mword)
v200 512 x 512 x 16 x 200 24 14
wa200 1024 x 1024 x 40 x 200 226 126
xb200 2048 x 2048 x 80 x 200 1812 984
The memory sizes are the number of Fortran real elements
(words) required for the strip-mined case on a single processor.
For the masked case the memory requirement is approximately doubled
for the extra mask arrays. For the message passing versions, the
total memory requirement will also tend to increase slightly (<10%)
with the number of processors employed.
-------------------------------------------------------------------------------
How did you determine the number of floating-point operations (hardware
monitor, count by hand, etc.) :
Count by hand looking at inner loops and making reasonable assumptions
about common compiler optimizations.
-------------------------------------------------------------------------------
Other relevant information:
-------------------------------------------------------------------------------
PARKBENCH future compact applications page