If you are planning to do some numerical calculations, the following suggestions may help you to do your work :
Fortran as a Higher-level Language
Some Sample Problems Solved with Public Domain Software
MATLAB : The Rome that all roads lead to
Using Workstation Clusters to squeeze extra cycles
By following the suggestions below, you should be able to greatly reduce the amount of coding you need to do, while ending up with more efficient and robust programs. By 'properly' breaking your problem into its logical constituent parts, e.g. :
you may be able to write your program by largely calling a series of high quality public domain routines and packages. Thus, by accessing the huge, and growing volume of high quality Fortran software in the public domain, you are essentially using Fortran as a higher level language.
LAPACK and BLAS routines as much as
possible. Highly optimized versions of the BLAS are available
for many computers. This will help you to get the best performance out of
a given system, and it will also be portable. Replace LINPACK
and EISPACK routines by the corresponding LAPACK
routines, particularly if you are dealing with large matrices.BLAS routine, can be done at twice
the speed of matrix-vector multiplication. DO loop :
DO j=1,n
DO i=1,m
c(i,j) = c(i,j)*beta
ENDDO
ENDDO
may be replaced by
c(1:m,1:n) = c(1:m,1:n) * beta
This is similar to an expression in MATLAB .
For more information about Fortran 90 click here
DO loop unrolling whenever possible. A very simple example is given below :
DO 50 j = 1,m,4
DO 40 k = 1,n
DO 30 i = 1,l
c(i,k) = c(i,k) + a(i,j)*b(j,k) +
& a(i,j+1)*b(j+1,k) + a(i,j+2)*b(j+2,k) +
& a(i,j+3)*b(j+3,k)
30 CONTINUE
40 CONTINUE
50 CONTINUE
LAOPS may be run on your system to give you an indication of the impact of cache memory on several simple algorithms. use :AL = SQRT (ALSQ)
DO 20 I = 1, NX(I) = X(I) / AL
20 CONTINUE
AL = 1.0/SQRT (ALSQ)
DO 20 I = 1, N
X(I) = X(I) * AL
20 CONTINUE
DO loop unrolling
and cache memory is The IBM RISC System/6000 and Linear Algebra Operations, As a general rule, it is a very good idea to make extensive use of LAPACK/BLAS
when writing your program. Also, many general purpose codes now available in the public domain make extensive use of these routines.
In the latest release of LAPACK, the new code, DSYEVD,
uses the 'divide-and-conquer' algorithm to calculate all the eigenvalues/vectors
of real symmetric matrix. DSYEVD makes much greater use of the
BLAS routines than the old DSYEV. In fact, by using
prof, we find that 66% of the cpu time for DSYEVD
was spent in the matrix-matrix multiplication routine DGEMM. DGEMM
is available in highly optimized form on many computers such as the IBM
RS/6000. On a model 375 system the cpu times are 231 and 77 sec for
DSYEV and DSYEVD respectively. The improvement is
significantly better than for the Sparc. Similar remarks hold true for programs
using Fast Fourier Transforms, especially in 2 or 3 dimensions.
These ideas and routines will also be a good starting point if you want to parallelize your code. Some results for multiprocessor systems are available.
Here is a short list of some LAPACK / BLAS based Software
They should be a good core for large scale
calculations involving large vectors and matrices. You should use the
-p option for f77 to profile your program to see if most of the
cpu time is indeed spent in the BLAS routines.
NSUVP : A general purpose FEM code (FELIB) is used to solve the Navier-Stokes Equations using a velocity/pressure formulation on mixed elements.
WEBILU : A general purpose code for differential-algebraic equations solves a system that arises from a food web population model, with predator-prey interaction and diffusion on the unit square in two dimensions.
RADAR5 : A general purpose code for delay differential equations
is used to solve the biological problems :
- acute hepatitis B virus infection problem.
- 4D enzyme kinetics with an inhibitor molecule
- threshold model for antibody production
DK64 : A general purpose DDE solver (DKLAG6) is used for the Solution of the MacKey-Glass Equation for Chaotic Attractors of an Infinite-Dimensional Dynamical System,
VORTEX : A program to simulate laminar flow about a cylinder which may be circular or of limited general shape (without sharp corners)
RNRTST : A test program for the multivariate integration routine RNRTMX which is used to compute expected values for each of the variables for several statistics problems, including :
- Birthweight Data - Bio Oxygen Data - Contingency Table Data - Stanford Heart Transplant Data - Econometric Data - Lubricant Data - Multivariate Logistic Distribution - Multivariate Normal Distribution - Photocarcinogen Data - Radiotherapy Data - Tornado DataClick here for other multivariate integration programs and reference material.
CSE2DD : A program to solve the time-independent cubic Schroedinger eqn :
LAP(U) = -EIG*(U-U^3) : LAP is the 2D Laplacian operator
in the unit square, with U = 0 on the boundary.
Nontrivial solutions bifurcate from the trivial solutions only at the
eigenvalues of the linearized problem :
EIG = EIG0 = (m^2+n^2)*Pi^2 ; m,n > 0, integer
The eigenfunctions of the linearized problem are :
U(x,y) = U0(x,y) = sin(m*Pi*x)*sin(n*Pi*y)
VARAH : A spline least squares method for numerical parameter estimation in differential equations is used to fit the reaction coefficients in the thermal isomerization of alpha-pinene The 5 differentail equations are describing this process are:
y'(1) = - (p(1)+p(2))*y(1)
y'(2) = p(1)*y(1)
y'(3) = p(2)*y(1) - (p(3)+p(4))*y(3) + p(5)*y(5)
y'(4) = p(3)*y(3)
y'(5) = p(4)*y(3) - p(5)*y(5)
where p(*) are the 5 parameters which are to be determined
BURGER : A collection of codes for solving the Burger equation, which has the form :
U + U*U - vis*U = 0
t x xx
where vis is the viscosity.
CONDIT This program solves the system of linear equations from the discretized convection-diffusion problem :
U + U + cU + dU = f
xx yy x
on [0,1]x[0,1] with dirichlet boundary conditions. Discretization is by
centered differences.
As n = nx**2 increases, the problem becomes increasingly difficult to solve
without preconditioning. As c and d increase, the preconditioner becomes
increasingly less effective.
This problem was solved using a bicgstab iterative linear solver
for the case c = 10, d = 100, nx = 999, n = 998001 on a HyperSparc. When no
preconditioner was used, the cpu time required was 10672 sec. When a
Fast Poisson solver was used as a preconditioner, the cpu time required
was 1951 sec. Besides this dramatic reduction in cpu time, there is the
possibility of a further reduction since about 60% of the cpu time for
the preconditioned job was taken by the Fast Poisson solver which uses
2D FFTs. Optimized/parallelized versions of FFTs are available for many
computers.
CAVPIT
This program solves the steady-state Navier-Stokes eqn in 2D for the flow
in a driven cavity. The function solved for is the streamfunction. The
vorticity may be obtained by differentiating the streamfunction. The solver
used is PITCON. It will solve for a series of values of the
Reynolds number.
ASTMHD This program solves the Sturm-Liouville eigenvalue equation :
(p(x)*y'(x))' + (q(x) + eig*r(x))*y(x) = 0
on (a,b) for an Astrophysics MagnetoHydrodynamics problem with :
p = x
q = -x* (k^2+1/x^2)
r = 1/x^3
(a,b) = (1,20)
CSE This program solves the time dependent cubic Schroedinger eqn. :
i*U + U + q*(abs(U)^2)*U = 0
t xx
The solution to this problem is the soliton :
U(x,t) = sqrt(2)*exp(i*(0.5*x + 0.75*t))*sech(x-t)
DISEASE This program solves the coupled nonlinear Volterra integral equations of the second kind
t
/
y(t) = g(t) + | k(t,s,y(s)) ds, t in [t0,te].
/
t0
The present problem is taken from :
Hethcote H. W. and Tudor D. W.;
Integral Equation Models for Endemic Infectious Diseases,
J. Math. Bio., vol. 9, page 37, 1980
[t0, te] = [0, 50]
g(t) = ( a^21/ 100 )
(1+(10-a^20).a) / 100 )
k(t,s,y) = ( b^21 0 ) ( 3.y(1).(1-y(1)-y(2)) )
( (1-b^20).b b/1000 ) ( 1-y(1)-y(2) )
where a = exp(-t/20) and b = exp((s-t)/20)
SCTPWR As part of a Medical Physics project to solve the Boltzmann
equation for high energy electrons, the program SCTPWR was
written to evaluate the Hankel transform :
pi
/
I(q) = | f(x) J0(qx) dx
/
0
The function f(x) varies very rapidly, going from 0 -> 10^6 as x goes
from 0 -> Pi/10000. After that, f(x) drops off to
1/10^6 at x = Pi. This sharp behavior in f(x) leads
to I(q) being very spread out -
values as high as q = 20000 needed to be calculated before
I(q) had decreased sufficiently. The public domain code
HANKEL was satisfactorily able to perform these calculations
because it uses a sophisticated adaptive strategy.
MOVEBD This program solves the moving boundary (Stefan) problem. The problem has melting due to heat input at the fixed boundary . The P.D.E. is defined by the equations
U = U 0 < y < s(t), 0.1 < t < 1
t yy
U = -exp(t) , y = 0
y
U = 0, s(t) = -U
y
on the moving boundary y = s(t),
and the initial solution values at t = 0.1 are given by the
analytic
solution
U = exp(t-y) - 1 , s(t) = t.
The problem is rewritten by using the co-ordinate transformation
given by x(t) = y/s(t). the equations then read
.
s * s * U - s * s * x * U = U , x in (0,1)
t x xx
with the Neumann type boundary conditions
U = -exp(t) at x=0 and U = -s(t) * s(t) at x = 1
x x
and the O.D.E. coupling point equation at x = 1 which implicitly
defines s(t) is given by
U(1,t) = 0
The exact solution is now defined by
U(x,t) = exp((t - x*s(t)) , s(t) = t
CAVITY A variety of public domain codes are used to solve aspects of the Navier-Stokes equation in 2D.
CAUIE This program solves the singular Cauchy integral eqn :
1
/ u(x,y)
phi(x) = f(x) + alam* l dy -------- phi(y)
/ (x-y)
-1
phi(x) is the unknown function, f(x) and u(x,y) are user-specified functions, and alam is a specified constant. After paying special attention to the singularity, the integral equation was reduced to a system of linear algebraic equations. These are solved using standard LAPACK and BLAS routines. This automatically gives a highly optimized/parallelized program.
SCH3D This program solves for the 3 lowest eigenvalues/vectors of the 3D Schroedinger equation for an anisotropic harmonic oscillator. The equation is discretized using a finite-difference scheme, to give a large sparse matrix with N = 1030301.
It has become rather obvious to many people that Matlab is a very useful, productive working environment. For those who may not be familiar with Matlab as well as those who may not be aware of some recent developments, the following is a brief description of some of the reasons for using Matlab.
.m and .mex
interface files that allow you to call Fortran or C routines
directly in Matlab. This will save the great effort to convert Fortran
routines to Matlab, and vastly increase the availlable software.
See TaskVA for an example.
.mat files .
The above points will hopefully convince you that more and more of your numerical/symbolic/graphical computations may be done in one program - Matlab.
INDUSTRY HARNESSES CLUSTERED WORKSTATIONS TO SQUEEZE EXTRA CYCLES
Mar. 31 FEATURE by Elisabeth Wechsler,
NAS News HPCwire
******************************************************************* Three case studies of recent aeronautics industry experience show how workstation clusters are being used as alternative forms of high-performance computing.
McDonnell Douglas Adapts PVM To Meet CFD Production Code Challenges
On any given night or weekend, McDonnell Douglas has as many as 400 workstations (with an average of 200 per session) divided into clusters of 20 workstations per parallel job doing CFD (computational fluid dynamics) processing, according to Ray Cosner, CFD Applications manager at McDonnell Douglas Aerospace, St. Louis.
Using the public domain Parallel Virtual Machine (PVM) software as a foundation, his group implemented and adapted PVM's parallel processing capability into its production CFD code, NASTD, in five months. This McDonnell Douglas proprietary code functions similarly to multi-block structured NASA codes such as OVERFLOW, CFL3D and NPARC/PARC3D.
Cosner is pleased with the use of PVM-based parallel processing on workstation clusters for large-scale, time-critical CFD analyses. "Our experience has met or exceeded our expectations," he said. "By utilizing otherwise-wasted workstation computing cycles, we've expanded our useful in-house CFD computing capacity by perhaps 20 times."
A small portion of the jobs run on a heterogeneous collection of workstations and the remainder use Hewlett-Packard models of various types and sizes. However, in Cosner's opinion "heterogeneity with PVM implementations is a non-issue." The real question is one of network topology: "How many network links must be crossed when grouping workstations together to run a job?"
Two years ago, McDonnell Douglas began running selected jobs in parallel on workstation clusters while production jobs were running in single mode. At this point, some problems with fault tolerance were detected. "We believe our attention to this class of issues is one reason our experience has been so positive," Cosner said.
Whereas individual workstations and network elements may be 99 percent fault-free for a short job, a cluster of 20 workstations and 50 network elements offer "a very high probability of disruption" for a long-running parallel job, Cosner explained. "A straight installation without considering fault tolerance fails to address a parallel job's need to deal with network hiccups."
After the fault tolerance problem was solved, the workstation clusters were ready for "unrestricted production" in October 1993. Solutions are now checkpointed at user-controlled intervals and various tests are conducted to monitor the health of the network links and the processing nodes.
"If a fault is detected, the parallel system is reconfigured without the offending elements and the job is restarted from the last checkpoint," he said. "We've had jobs successfully reconfigure and restart as many as three times in one night."
These defensive measures have boosted McDonnell Douglas's single-job overnight success rate to about 95 percent. "This reliability is essential for user acceptance of workstation cluster computing," Cosner said. Without these measures, he estimates the success rate for a simple PVM implementation to be well below 50 percent.
Cosner noted that McDonnell Douglas has not yet tried to optimize operating systems, CFD algorithms, database design, or message composition to further increase efficiency in its workstation cluster environment, admitting that "this is an immediate research need."
However, Cosner is convinced that important improvements can still be made. For example, he noted that "the process of periodic solution checkpointing wastes computing cycles." He would like to see "the current 5 percent failure rate reduced substantially, with checkpointing overhead also reduced dramatically," and called for "an improved set of diagnostic tools to improve efficiency and detect failures affecting this type of processing."
Other areas where Cosner would like to see improvement are job scheduling and load balancing. "We've had chronic problems with PVM jobs (which run in the background) stalling because a higher priority process is running on the same node." He explained that this can lead to "all sorts of degradations, including CPU and memory conflicts and `page thrashing' which accounts for most of our current overnight job failures."
A related problem is "runaway" or "hung" processes (typically, non-CFD tasks such as system accounting processes) which are not detected until they "effectively kill a PVM job," he said.
The only corporate problem with the workstation cluster project was "convincing the rest of the company that hardware is not personal property." Because most of the competing work is related to interactive computer-aided design (CAD), restricting access to periods between 7:00 p.m. and 6:00 a.m. and on weekends "wasn't much of an imposition," he said.
However, if a CAD engineer came in early, that work would take priority. "This caused the most damaging scenario," Cosner said, adding, "If one of our jobs wouldn't shut down at the end of the night, this also created problems."
Cosner believes that the workstation cluster parallel computing at McDonnell Douglas has been "a highly positive experience providing supercomputer-equivalent cycles in a cost-constrained environment." He hopes "to build on this base for a longer-range evolution into more powerful parallel implementations."
Teamwork Plus 600 Off-duty Workstations Reduce Design Time, Cost at Pratt & Whitney
Pratt & Whitney (P&W) has achieved the throughput equivalent of 16 CRAY C90 CPUs by using networked workstations at two sites, according to Dan Minior, manager of Aerodynamic and Mechanical Integrity Systems for Management Information Systems.
Computing at P&W is driven by customer requirements to increase performance and the company's desire to reduce design time and cost. P&W's goal is to maximize asset utilization and Minior believes that loosely clustered workstations contribute to that goal.
By running jobs on parallel workstations during off hours to harvest cycles equivalent to approximately 75-90 CRAY X-MPs, P&W demonstrated this approach as a "viable alternative to investing in more costly supercomputers," Minior said. Just three years ago, the same approach delivered only 2-3 CRAY X-MP equivalent cycles, he added.
P&W began a transition from mainframes to workstations at two domestic sites (East Hartford and West Palm Beach) in 1988-89, and achieved its first major success with parallel workstations in 1992 when Sun SPARC 2s were used in a PW4000 fan redesign. "Required accuracy was achieved, lead time was reduced from 18 to 6 months and savings in excess of $20 million were realized from improved aerodynamic efficiency and reduced testing," Minior said.
In the last six months, P&W has completed the redesign of the high pressure compressor for the advanced version of the PW4000 engine. Efficiency was improved substantially without an adverse effect on stall margin. He predicted that over the life of an aircraft (approximately 20 years) the fleet of P&W's next generation engines "may obtain fuel savings of $1 billion."
To complete the redesign, P&W harnessed 600 workstations in East Hartford plus 300 in West Palm Beach during off hours. During the day, these workstations are used for CAD work which requires a high degree of interaction with engineers. As first-shift employees leave between 3:30 p.m. and 6:00 p.m., their workstations become part of a pool for calculations until 6:00 a.m. when they revert to daytime use. P&W also uses the workstation network on weekends, holidays and even during the annual facili
Results show that P&W has been able to establish a production environment with workstation clusters and Minior attributes much of the credit to "a high degree of teamwork." Eleven engineers representing networking, operating support, application developers and end users provide applications support to all of P&W.
One sub-team spent about a year on reliability issues -- addressing a variety of glitches, from problems with UNIX internals to mishaps caused by evening cleaning crews. As a result of this focus, Minior claims that P&W's reliability for its workstation clusters improved from 30 percent to 93 to 99 percent.
The P&W network consists mostly of Sun workstations of varying sizes and types, one operating system and a variety of software applications. P&W developed its own communication software PROWESS (Parallel Running of Workstations Employing Sockets).
P&W also uses Load Sharing Facility (LSF) software developed by Platform Computing, Toronto, to maintain "a stable production environment," Minior said. LSF helps with resource monitoring, system management and job scheduling. Presently, LSF works for serial batch only; however, Minior is working with Platform Computing to modify the software to handle parallel jobs.
Minior said that further work is needed on system software and applications to work with codes using network communications. "We haven't solved all the problems, but we see a great advantage to this alternative form of compute power," he said. "The future should continue to bring us high utilization of equipment and substantial achievement in compute power at low cost."
The next steps, as Minior sees it, are improved simulation software, faster workstations, faster networks and improved message-passing. "This will facilitate workstation clustering and will enable NASA and industry to work together to advance the technology, accelerate the pace of deployment and improve U.S. competitiveness," he said, adding that he appreciates the support he has received "from all the NASA sites, particularly Lewis Research Center."
`Deployable Software' Is the Key at Boeing
Tom Wicks, program manager of High Performance Computing at Boeing Computer Services, stated that his company's investigation of cluster computing started with an attempt to "run parallel and sequential production codes in a transparent manner." The biggest challenge this effort faces is "deployable software" that allows clustered workstations to do message passing and job scheduling.
Boeing's work which began in 1992, has focused on the issue of "when is the right time and what are the right applications to use." Wicks believes that cluster computing is "applications dependent" and that not every application is appropriate for this environment.
Boeing has been conducting a variety of cluster pilot tests including interdepartmental testing on two workstation clusters. One cluster, completed last year, involved 14-20 workstations dedicated to propulsion applications. The other cluster, involving 14 workstations for helicopter applications, is just beginning. Some of the codes have extensive communications requirements and others don't. Boeing's efforts have also addressed application codes involving structures, propulsion, aerodynamics and electr
"The idea is to run some applications on mixed clusters," he said. "For example, 15-20 workstations running structures' applications may not have enough cycles, so we may want to run these applications on another group's cluster across the country. In this situation, we want software with transparency, not only to our department, but to other work groups as well," Wicks said.
The Boeing tests address network communications issues as both a resource and a requirement for job scheduling purposes, Wicks said. The company is exploring a control system, using fuzzy logic -- a mathematical technique for dealing with imprecise data that is more analogous to human logic -- which prioritizes the allocation of parallel jobs to, and their suspension from, a cluster of networked workstations. The main objective of this work is to facilitate the harnessing of computational resources from
"If engineers are to rely on workstation clusters, then reliable job scheduling that ensures fault tolerance, interoperability of systems software and load balancing is a necessary requirement. At present, most job scheduling software available for the cluster environment cannot schedule a job based on network load," Wicks said, adding that "most of the job scheduling software takes into account the status of the CPUs, not the network."
"We'd like to submit the code through some mechanism to a job scheduler that will assign the most appropriate workstation cluster based on communications, CPU and disk requirements." The work may be done sequentially on the person's desk, a shared memory machine or maybe a workstation cluster in another department."
Wicks said Boeing has tried most of the major hardware vendors for workstations and supercomputers and wants the whole environment to be considered for deployment. "We've gone beyond proof of principle to: How do you deploy in a production paradigm?" He noted that issues such as system administration, security, network configuration, infrastructure and integration issues still need to be resolved.
Two years ago, a typical attitude at Boeing was, "my machine is my machine," Wicks said. "Now the bigger problem is reliability of workstation clusters: If you can't do your work because the software isn't working properly, you'll be reluctant to participate in a cluster project."
In conclusion, Wicks said, "We are cautious about the potential of workstation clusters. We're concerned about putting a production paradigm in place before the right software and support infrastructure is there to handle it and reliable, commercial software has not been as available as we'd like. We want off-the-shelf software products to ensure that they're properly supported"
Elisabeth Wechsler is a senior writer for NAS News, a bimonthly publication of the Numerical Aerodynamic Simulation Facility. This article appears courtesy of NAS News and the NAS Systems Division. ********************************************************************** HPCwire has released all copyright restrictions for this item. Please feel free to distribute this article to your friends and colleagues.