To quote Winston Churchill :
" Give us the tools
and we will finish the job "

If you are planning to do some numerical calculations, the following suggestions may help you to do your work :

Fortran as a Higher-level Language

Some Sample Problems Solved with Public Domain Software

MATLAB : The Rome that all roads lead to

Using Workstation Clusters to squeeze extra cycles





Fortran as a Higher-level Language

" Give us the tools ... "

By following the suggestions below, you should be able to greatly reduce the amount of coding you need to do, while ending up with more efficient and robust programs. By 'properly' breaking your problem into its logical constituent parts, e.g. :

you may be able to write your program by largely calling a series of high quality public domain routines and packages. Thus, by accessing the huge, and growing volume of high quality Fortran software in the public domain, you are essentially using Fortran as a higher level language.

     DO 50 j = 1,m,4
       DO 40 k = 1,n
         DO 30 i = 1,l
           c(i,k) = c(i,k) + a(i,j)*b(j,k) + 
    &        a(i,j+1)*b(j+1,k) + a(i,j+2)*b(j+2,k) + 
    &        a(i,j+3)*b(j+3,k)
30       CONTINUE
40     CONTINUE
50   CONTINUE 
 

As a general rule, it is a very good idea to make extensive use of LAPACK/BLAS when writing your program. Also, many general purpose codes now available in the public domain make extensive use of these routines.

In the latest release of LAPACK, the new code, DSYEVD, uses the 'divide-and-conquer' algorithm to calculate all the eigenvalues/vectors of real symmetric matrix. DSYEVD makes much greater use of the BLAS routines than the old DSYEV. In fact, by using prof, we find that 66% of the cpu time for DSYEVD was spent in the matrix-matrix multiplication routine DGEMM. DGEMM is available in highly optimized form on many computers such as the IBM RS/6000. On a model 375 system the cpu times are 231 and 77 sec for DSYEV and DSYEVD respectively. The improvement is significantly better than for the Sparc. Similar remarks hold true for programs using Fast Fourier Transforms, especially in 2 or 3 dimensions.

These ideas and routines will also be a good starting point if you want to parallelize your code. Some results for multiprocessor systems are available.

Here is a short list of some LAPACK / BLAS based Software They should be a good core for large scale calculations involving large vectors and matrices. You should use the -p option for f77 to profile your program to see if most of the cpu time is indeed spent in the BLAS routines.


Some Sample Problems Solved with Public Domain Software

" ... and we will finish the job "

NSUVP : A general purpose FEM code (FELIB) is used to solve the Navier-Stokes Equations using a velocity/pressure formulation on mixed elements.

WEBILU : A general purpose code for differential-algebraic equations solves a system that arises from a food web population model, with predator-prey interaction and diffusion on the unit square in two dimensions.

RADAR5 : A general purpose code for delay differential equations

              is used to solve the biological problems :
              - acute hepatitis B virus infection problem.
              - 4D enzyme kinetics with an inhibitor molecule
              - threshold model for antibody production

DK64 : A general purpose DDE solver (DKLAG6) is used for the Solution of the MacKey-Glass Equation for Chaotic Attractors of an Infinite-Dimensional Dynamical System,

VORTEX : A program to simulate laminar flow about a cylinder which may be circular or of limited general shape (without sharp corners)

RNRTST : A test program for the multivariate integration routine RNRTMX which is used to compute expected values for each of the variables for several statistics problems, including :

   - Birthweight Data
   - Bio Oxygen Data
   - Contingency Table Data
   - Stanford Heart Transplant Data
   - Econometric Data
   - Lubricant Data
   - Multivariate Logistic Distribution
   - Multivariate Normal Distribution
   - Photocarcinogen Data
   - Radiotherapy Data
   - Tornado Data
Click here for other multivariate integration programs and reference material.

CSE2DD : A program to solve the time-independent cubic Schroedinger eqn :


      LAP(U) = -EIG*(U-U^3) : LAP is the 2D Laplacian operator

 in the unit square, with U = 0 on the boundary.
 Nontrivial solutions bifurcate from the trivial solutions only at the
 eigenvalues of the linearized problem :

      EIG = EIG0 = (m^2+n^2)*Pi^2 ;  m,n > 0, integer

 The eigenfunctions of the linearized problem are :

      U(x,y) = U0(x,y) = sin(m*Pi*x)*sin(n*Pi*y)

VARAH : A spline least squares method for numerical parameter estimation in differential equations is used to fit the reaction coefficients in the thermal isomerization of alpha-pinene The 5 differentail equations are describing this process are:

       y'(1) = - (p(1)+p(2))*y(1)
       y'(2) = p(1)*y(1)
       y'(3) = p(2)*y(1) - (p(3)+p(4))*y(3) + p(5)*y(5)
       y'(4) = p(3)*y(3)
       y'(5) = p(4)*y(3) - p(5)*y(5)

       where p(*) are the 5 parameters which are to be determined

BURGER : A collection of codes for solving the Burger equation, which has the form :

    U  + U*U  - vis*U   = 0
     t      x        xx
where vis is the viscosity.

CONDIT This program solves the system of linear equations from the discretized convection-diffusion problem :


    U   + U   + cU + dU  = f
     xx    yy          x

on [0,1]x[0,1] with dirichlet boundary conditions. Discretization is by centered differences. As n = nx**2 increases, the problem becomes increasingly difficult to solve without preconditioning. As c and d increase, the preconditioner becomes increasingly less effective. This problem was solved using a bicgstab iterative linear solver for the case c = 10, d = 100, nx = 999, n = 998001 on a HyperSparc. When no preconditioner was used, the cpu time required was 10672 sec. When a Fast Poisson solver was used as a preconditioner, the cpu time required was 1951 sec. Besides this dramatic reduction in cpu time, there is the possibility of a further reduction since about 60% of the cpu time for the preconditioned job was taken by the Fast Poisson solver which uses 2D FFTs. Optimized/parallelized versions of FFTs are available for many computers.

CAVPIT This program solves the steady-state Navier-Stokes eqn in 2D for the flow in a driven cavity. The function solved for is the streamfunction. The vorticity may be obtained by differentiating the streamfunction. The solver used is PITCON. It will solve for a series of values of the Reynolds number.

ASTMHD This program solves the Sturm-Liouville eigenvalue equation :

     (p(x)*y'(x))' + (q(x) + eig*r(x))*y(x) = 0

on (a,b) for an Astrophysics MagnetoHydrodynamics problem with :

     p = x
     q = -x* (k^2+1/x^2)
     r = 1/x^3
     (a,b) = (1,20)

CSE This program solves the time dependent cubic Schroedinger eqn. :


       i*U + U  + q*(abs(U)^2)*U = 0
          t   xx

The solution to this problem is the soliton :

     U(x,t) = sqrt(2)*exp(i*(0.5*x + 0.75*t))*sech(x-t)

DISEASE This program solves the coupled nonlinear Volterra integral equations of the second kind

                   t
                    /                                                
      y(t) = g(t) + | k(t,s,y(s)) ds,   t in [t0,te].               
                    /                                                
                    t0
The present problem is taken from : Hethcote H. W. and Tudor D. W.; Integral Equation Models for Endemic Infectious Diseases, J. Math. Bio., vol. 9, page 37, 1980
  [t0, te] = [0, 50]

  g(t)     = (     a^21/ 100       )
             (1+(10-a^20).a) / 100 )

  k(t,s,y) = (    b^21          0   ) ( 3.y(1).(1-y(1)-y(2)) )
             ( (1-b^20).b    b/1000 ) (     1-y(1)-y(2)      )

  where a = exp(-t/20)  and  b = exp((s-t)/20)

SCTPWR As part of a Medical Physics project to solve the Boltzmann equation for high energy electrons, the program SCTPWR was written to evaluate the Hankel transform :

            pi
            /
     I(q) = | f(x) J0(qx) dx
            / 
            0

The function f(x) varies very rapidly, going from 0 -> 10^6 as x goes
from 0 -> Pi/10000. After that, f(x) drops off to 1/10^6 at x = Pi. This sharp behavior in f(x) leads to I(q) being very spread out - values as high as q = 20000 needed to be calculated before I(q) had decreased sufficiently. The public domain code HANKEL was satisfactorily able to perform these calculations because it uses a sophisticated adaptive strategy.

MOVEBD This program solves the moving boundary (Stefan) problem. The problem has melting due to heat input at the fixed boundary . The P.D.E. is defined by the equations

      U = U        0 < y < s(t), 0.1 < t < 1
       t   yy

      U = -exp(t) , y = 0
       y

      U = 0,  s(t) = -U
                       y
on the moving boundary y = s(t), and the initial solution values at t = 0.1 are given by the analytic solution U = exp(t-y) - 1 , s(t) = t. The problem is rewritten by using the co-ordinate transformation given by x(t) = y/s(t). the equations then read
                      .
     s * s * U - s * s * x * U =  U , x in (0,1)
              t               x    xx
with the Neumann type boundary conditions
                                      
     U = -exp(t) at x=0  and U = -s(t) * s(t) at x = 1
      x                       x
and the O.D.E. coupling point equation at x = 1 which implicitly defines s(t) is given by
     U(1,t) = 0
The exact solution is now defined by
     U(x,t) = exp((t - x*s(t)) , s(t) = t

CAVITY A variety of public domain codes are used to solve aspects of the Navier-Stokes equation in 2D.

CAUIE This program solves the singular Cauchy integral eqn :



                           1
                           /     u(x,y)
     phi(x) = f(x) + alam* l dy -------- phi(y)
                           /     (x-y)
                          -1

phi(x) is the unknown function, f(x) and u(x,y) are user-specified functions, and alam is a specified constant. After paying special attention to the singularity, the integral equation was reduced to a system of linear algebraic equations. These are solved using standard LAPACK and BLAS routines. This automatically gives a highly optimized/parallelized program.

SCH3D This program solves for the 3 lowest eigenvalues/vectors of the 3D Schroedinger equation for an anisotropic harmonic oscillator. The equation is discretized using a finite-difference scheme, to give a large sparse matrix with N = 1030301.


MATLAB : The Rome that all roads lead to.

It has become rather obvious to many people that Matlab is a very useful, productive working environment. For those who may not be familiar with Matlab as well as those who may not be aware of some recent developments, the following is a brief description of some of the reasons for using Matlab.

  1. It is a very powerful, friendly language.
  2. It has a large, and growing number of built-in routines.
  3. It offers sophisticated graphics facilities : line plots, contour plots, surface plots. You can also make movies of your output. The MatlabSockets package allows one to view the graphical output while your Fortran or C program is executing.
  4. There are many Matlab based books. Some of these include software.
  5. There is a growing volume of Matlab application software available in the public-domain, over the Internet. Note particularly the C++, Fortran, and Matlab Software from the book "Numerical Methods for Physics" by Dr. A. Garcia
  6. The Symbolic Math Toolboxes allow many of the Maple symbolic functions to be called directly in Matlab. It is also possible to access Matlab from Maple. For more information, startup Maple and enter ?matlab
  7. NAG has a Matlab Gateway Generator. This will make it possible to automatically generate .m and .mex interface files that allow you to call Fortran or C routines directly in Matlab. This will save the great effort to convert Fortran routines to Matlab, and vastly increase the availlable software. See TaskVA for an example.
  8. The Matlab 'Engine' allows you to access Matlab from a Fortran or C program. It is also possible to write out the Matlab binary .mat files .
  9. If you would like to check out some of the similarities/differences between Fortran 77, Fortran 90 and Matlab, click here for a simple Monte Carlo integration program.
  10. Matlab now has a compiler (mcc) which allows you to convert Matlab code to C code. This should be particularly useful for Matlab code that is particularly slow, or to prototype code that you want to include in a C program.
  11. BEWARE : Some 'for' loops in Matlab can be MUCH slower than the corresponding DO loos in Fortran. For a simple example dealing with chaotic motion, Click Here. Note that the Matlab Compiler (see above) can often convert a slow .m file into a faster .mex file.
  12. f77toM : A Perl 5 script to convert one or more f77 or f90 files toMatlab M-files.
  13. Parallel Software and Matlab
  14. MATLAB Incorporates LAPACK : Increasing the speed and capabilities of matrix computation
  15. MISC : Some miscellaneous Matlab software.

The above points will hopefully convince you that more and more of your numerical/symbolic/graphical computations may be done in one program - Matlab.


Using Workstation Clusters to squeeze extra cycles

INDUSTRY HARNESSES CLUSTERED WORKSTATIONS TO SQUEEZE EXTRA CYCLES
Mar. 31 FEATURE by Elisabeth Wechsler,
NAS News HPCwire
******************************************************************* Three case studies of recent aeronautics industry experience show how workstation clusters are being used as alternative forms of high-performance computing.


McDonnell Douglas Adapts PVM To Meet CFD Production Code Challenges

On any given night or weekend, McDonnell Douglas has as many as 400 workstations (with an average of 200 per session) divided into clusters of 20 workstations per parallel job doing CFD (computational fluid dynamics) processing, according to Ray Cosner, CFD Applications manager at McDonnell Douglas Aerospace, St. Louis.

Using the public domain Parallel Virtual Machine (PVM) software as a foundation, his group implemented and adapted PVM's parallel processing capability into its production CFD code, NASTD, in five months. This McDonnell Douglas proprietary code functions similarly to multi-block structured NASA codes such as OVERFLOW, CFL3D and NPARC/PARC3D.

Cosner is pleased with the use of PVM-based parallel processing on workstation clusters for large-scale, time-critical CFD analyses. "Our experience has met or exceeded our expectations," he said. "By utilizing otherwise-wasted workstation computing cycles, we've expanded our useful in-house CFD computing capacity by perhaps 20 times."

A small portion of the jobs run on a heterogeneous collection of workstations and the remainder use Hewlett-Packard models of various types and sizes. However, in Cosner's opinion "heterogeneity with PVM implementations is a non-issue." The real question is one of network topology: "How many network links must be crossed when grouping workstations together to run a job?"

Two years ago, McDonnell Douglas began running selected jobs in parallel on workstation clusters while production jobs were running in single mode. At this point, some problems with fault tolerance were detected. "We believe our attention to this class of issues is one reason our experience has been so positive," Cosner said.

Whereas individual workstations and network elements may be 99 percent fault-free for a short job, a cluster of 20 workstations and 50 network elements offer "a very high probability of disruption" for a long-running parallel job, Cosner explained. "A straight installation without considering fault tolerance fails to address a parallel job's need to deal with network hiccups."

After the fault tolerance problem was solved, the workstation clusters were ready for "unrestricted production" in October 1993. Solutions are now checkpointed at user-controlled intervals and various tests are conducted to monitor the health of the network links and the processing nodes.

"If a fault is detected, the parallel system is reconfigured without the offending elements and the job is restarted from the last checkpoint," he said. "We've had jobs successfully reconfigure and restart as many as three times in one night."

These defensive measures have boosted McDonnell Douglas's single-job overnight success rate to about 95 percent. "This reliability is essential for user acceptance of workstation cluster computing," Cosner said. Without these measures, he estimates the success rate for a simple PVM implementation to be well below 50 percent.

Cosner noted that McDonnell Douglas has not yet tried to optimize operating systems, CFD algorithms, database design, or message composition to further increase efficiency in its workstation cluster environment, admitting that "this is an immediate research need."

However, Cosner is convinced that important improvements can still be made. For example, he noted that "the process of periodic solution checkpointing wastes computing cycles." He would like to see "the current 5 percent failure rate reduced substantially, with checkpointing overhead also reduced dramatically," and called for "an improved set of diagnostic tools to improve efficiency and detect failures affecting this type of processing."

Other areas where Cosner would like to see improvement are job scheduling and load balancing. "We've had chronic problems with PVM jobs (which run in the background) stalling because a higher priority process is running on the same node." He explained that this can lead to "all sorts of degradations, including CPU and memory conflicts and `page thrashing' which accounts for most of our current overnight job failures."

A related problem is "runaway" or "hung" processes (typically, non-CFD tasks such as system accounting processes) which are not detected until they "effectively kill a PVM job," he said.

The only corporate problem with the workstation cluster project was "convincing the rest of the company that hardware is not personal property." Because most of the competing work is related to interactive computer-aided design (CAD), restricting access to periods between 7:00 p.m. and 6:00 a.m. and on weekends "wasn't much of an imposition," he said.

However, if a CAD engineer came in early, that work would take priority. "This caused the most damaging scenario," Cosner said, adding, "If one of our jobs wouldn't shut down at the end of the night, this also created problems."

Cosner believes that the workstation cluster parallel computing at McDonnell Douglas has been "a highly positive experience providing supercomputer-equivalent cycles in a cost-constrained environment." He hopes "to build on this base for a longer-range evolution into more powerful parallel implementations."


Teamwork Plus 600 Off-duty Workstations Reduce Design Time, Cost at Pratt & Whitney

Pratt & Whitney (P&W) has achieved the throughput equivalent of 16 CRAY C90 CPUs by using networked workstations at two sites, according to Dan Minior, manager of Aerodynamic and Mechanical Integrity Systems for Management Information Systems.

Computing at P&W is driven by customer requirements to increase performance and the company's desire to reduce design time and cost. P&W's goal is to maximize asset utilization and Minior believes that loosely clustered workstations contribute to that goal.

By running jobs on parallel workstations during off hours to harvest cycles equivalent to approximately 75-90 CRAY X-MPs, P&W demonstrated this approach as a "viable alternative to investing in more costly supercomputers," Minior said. Just three years ago, the same approach delivered only 2-3 CRAY X-MP equivalent cycles, he added.

P&W began a transition from mainframes to workstations at two domestic sites (East Hartford and West Palm Beach) in 1988-89, and achieved its first major success with parallel workstations in 1992 when Sun SPARC 2s were used in a PW4000 fan redesign. "Required accuracy was achieved, lead time was reduced from 18 to 6 months and savings in excess of $20 million were realized from improved aerodynamic efficiency and reduced testing," Minior said.

In the last six months, P&W has completed the redesign of the high pressure compressor for the advanced version of the PW4000 engine. Efficiency was improved substantially without an adverse effect on stall margin. He predicted that over the life of an aircraft (approximately 20 years) the fleet of P&W's next generation engines "may obtain fuel savings of $1 billion."

To complete the redesign, P&W harnessed 600 workstations in East Hartford plus 300 in West Palm Beach during off hours. During the day, these workstations are used for CAD work which requires a high degree of interaction with engineers. As first-shift employees leave between 3:30 p.m. and 6:00 p.m., their workstations become part of a pool for calculations until 6:00 a.m. when they revert to daytime use. P&W also uses the workstation network on weekends, holidays and even during the annual facili

Results show that P&W has been able to establish a production environment with workstation clusters and Minior attributes much of the credit to "a high degree of teamwork." Eleven engineers representing networking, operating support, application developers and end users provide applications support to all of P&W.

One sub-team spent about a year on reliability issues -- addressing a variety of glitches, from problems with UNIX internals to mishaps caused by evening cleaning crews. As a result of this focus, Minior claims that P&W's reliability for its workstation clusters improved from 30 percent to 93 to 99 percent.

The P&W network consists mostly of Sun workstations of varying sizes and types, one operating system and a variety of software applications. P&W developed its own communication software PROWESS (Parallel Running of Workstations Employing Sockets).

P&W also uses Load Sharing Facility (LSF) software developed by Platform Computing, Toronto, to maintain "a stable production environment," Minior said. LSF helps with resource monitoring, system management and job scheduling. Presently, LSF works for serial batch only; however, Minior is working with Platform Computing to modify the software to handle parallel jobs.

Minior said that further work is needed on system software and applications to work with codes using network communications. "We haven't solved all the problems, but we see a great advantage to this alternative form of compute power," he said. "The future should continue to bring us high utilization of equipment and substantial achievement in compute power at low cost."

The next steps, as Minior sees it, are improved simulation software, faster workstations, faster networks and improved message-passing. "This will facilitate workstation clustering and will enable NASA and industry to work together to advance the technology, accelerate the pace of deployment and improve U.S. competitiveness," he said, adding that he appreciates the support he has received "from all the NASA sites, particularly Lewis Research Center."


`Deployable Software' Is the Key at Boeing

Tom Wicks, program manager of High Performance Computing at Boeing Computer Services, stated that his company's investigation of cluster computing started with an attempt to "run parallel and sequential production codes in a transparent manner." The biggest challenge this effort faces is "deployable software" that allows clustered workstations to do message passing and job scheduling.

Boeing's work which began in 1992, has focused on the issue of "when is the right time and what are the right applications to use." Wicks believes that cluster computing is "applications dependent" and that not every application is appropriate for this environment.

Boeing has been conducting a variety of cluster pilot tests including interdepartmental testing on two workstation clusters. One cluster, completed last year, involved 14-20 workstations dedicated to propulsion applications. The other cluster, involving 14 workstations for helicopter applications, is just beginning. Some of the codes have extensive communications requirements and others don't. Boeing's efforts have also addressed application codes involving structures, propulsion, aerodynamics and electr

"The idea is to run some applications on mixed clusters," he said. "For example, 15-20 workstations running structures' applications may not have enough cycles, so we may want to run these applications on another group's cluster across the country. In this situation, we want software with transparency, not only to our department, but to other work groups as well," Wicks said.

The Boeing tests address network communications issues as both a resource and a requirement for job scheduling purposes, Wicks said. The company is exploring a control system, using fuzzy logic -- a mathematical technique for dealing with imprecise data that is more analogous to human logic -- which prioritizes the allocation of parallel jobs to, and their suspension from, a cluster of networked workstations. The main objective of this work is to facilitate the harnessing of computational resources from

"If engineers are to rely on workstation clusters, then reliable job scheduling that ensures fault tolerance, interoperability of systems software and load balancing is a necessary requirement. At present, most job scheduling software available for the cluster environment cannot schedule a job based on network load," Wicks said, adding that "most of the job scheduling software takes into account the status of the CPUs, not the network."

"We'd like to submit the code through some mechanism to a job scheduler that will assign the most appropriate workstation cluster based on communications, CPU and disk requirements." The work may be done sequentially on the person's desk, a shared memory machine or maybe a workstation cluster in another department."

Wicks said Boeing has tried most of the major hardware vendors for workstations and supercomputers and wants the whole environment to be considered for deployment. "We've gone beyond proof of principle to: How do you deploy in a production paradigm?" He noted that issues such as system administration, security, network configuration, infrastructure and integration issues still need to be resolved.

Two years ago, a typical attitude at Boeing was, "my machine is my machine," Wicks said. "Now the bigger problem is reliability of workstation clusters: If you can't do your work because the software isn't working properly, you'll be reluctant to participate in a cluster project."

In conclusion, Wicks said, "We are cautious about the potential of workstation clusters. We're concerned about putting a production paradigm in place before the right software and support infrastructure is there to handle it and reliable, commercial software has not been as available as we'd like. We want off-the-shelf software products to ensure that they're properly supported"


Elisabeth Wechsler is a senior writer for NAS News, a bimonthly publication of the Numerical Aerodynamic Simulation Facility. This article appears courtesy of NAS News and the NAS Systems Division. ********************************************************************** HPCwire has released all copyright restrictions for this item. Please feel free to distribute this article to your friends and colleagues.