Parallel processing

oomph-lib is designed so that, provided the library is compiled with MPI support, discussed below in Basic parallel usage, many of the most computationally-expensive phases of a typical computation are automatically performed in parallel. Examples of "automatically parallelised" tasks include

The only parallel task that requires user intervention is the distribution of a problem over multiple processors so that each processor stores a subset of the elements. For straightforward problems, a single call to Problem::distribute() suffices. Furthermore, the majority of oomph-lib's multi-physics helper functions ( e.g. the automatic setup of the fluid load on solid elements in FSI problems; the determination of "source elements" in multi-field problems; etc) can be used in distributed problems. For less-straightforward problems, the user may have to intervene in the distribution process and/or be aware of the consequences of the distribution. Hence, the section Distribution of problems by domain decomposition provides an overview of the underlying design used for problem distribution within oomph-lib. A number of demo driver codes for distributed problems are provided and any additional issues are discussed in the accompanying tutorials.

Basic parallel usage

How to build/install oomph-lib with MPI support

How to run a driver code in parallel

oomph-lib's parallel linear solvers

How to include parallel demo codes into the self-tests

Distribution of problems by domain decomposition

By default each processor stores the entire Problem object, which means that all data is available on all processors. As a result the size of the problem is limited by the smallest amount of memory available on any of the processors. In addition, the mesh adaptation does not benefit from parallel processing because each processor must adapt its own copy of the entire mesh, even though it operates on a subset of the elements when assembling the Jacobian matrix.

To address this problem, oomph-lib's domain decomposition procedures allow a Problem to be distributed over multiple processors so that each processor holds a fraction of the Problem's elements, which can lead to substantial reductions in memory usage per processor and allows the mesh adaptation to be performed in parallel.

Basic usage: How to distribute a simple problem

Overview of the implementation of Problem distribution

The main task of the Problem::distribute() function is to distribute the Problem's global mesh (possibly comprising multiple sub-meshes) amongst the processors so that (a) the storage requirements on each processor are reduced; (b) during the mesh adaptation each processor only acts on a fraction of the overall mesh; while ensuring that (c) communication between processors required to synchronise any shared data structures is minimised.

The figure below shows a conceptual sketch of the parallelisation strategy adopted. For simplicity, we shall restrict the discussion to the 2D case and ignore various complications that arise with more complicated mesh partitionings.

The initial distribution of a problem proceeds in two stages:

Sketch illustrating the phases of the parallel mesh adaptation procedure for a problem that is distributed over four processors. The columns illustrate the evolution of the mesh on each of the four processors. The colours of the objects indicate which processor is associated with them.

Aside from the initial refinement process, the functionality described above is implemented in a single function, Problem::distribute(). Following its execution on all processors, each processor can assemble its contribution to the distributed Jacobian matrix and residual vector, required by oomph-lib's parallel linear solvers, using only to locally stored non-halo objects. Once the Newton correction to the unknowns has been computed, each processor updates the unknowns associated with its elements and nodes, before MPI-based communication is employed to update the unknowns stored at the processors' halo nodes.

After a problem has been distributed, further mesh refinement can be performed in parallel using the existing mesh adaptation procedures on the (partial) meshes held on the different processors. For spatially non-uniform refinement, each haloed element communicates whether or not it is to be refined to its halo counterparts before the adaptation takes place. Any nodes created during the refinement are associated with a unique processor, using the rules described above, and halo[ed] nodes are identified and added to the appropriate lookup schemes. These steps are performed automatically when Problem::refine_uniformly() or any of the other mesh adaptation routines within oomph-lib are executed.

Optional pruning of superfluous halo[ed] nodes and elements

The parallel efficiency of the distributed mesh adaptation (in terms of the memory required to hold the partial meshes, and in terms of the CPU time required for their adaptation) is limited by the fact that each processor must adapt not only the $N_{\rm in \ charge}$ elements it is in charge of, but also its $N_{\rm halo}$ halo elements. We define the efficiency of the problem distribution as

\[ e_{dist} = \frac{N_{\rm in \ charge}}{N_{\rm halo} + N_{\rm in \ charge}} \le 1, \]

where the equality could only be achieved in the absence of any halo elements.

When the mesh is first distributed, the halo layer has a depth of one element, but repeated mesh refinement can make the halo layers (the original halo elements and their sons) much thicker than a single-element layer. Thus, to a large extent, the efficiency is determined during the initial problem distribution and at that stage of the process it can only be improved by (i) increasing the number of non-distributed initial mesh refinements; (ii) reducing the number of processors. Since both options reduce the parallelism they are not desirable. It is possible, however, to improve the parallel efficiency by pruning superfluous halo[ed] elements after each mesh refinement by calling the function


as illustrated in the figure. If this is done after every mesh adaptation $ e_{dist}$ increases significantly as the refinement proceeds. However, the pruning of halo[ed] nodes and elements makes the refinement irreversible and the mesh(es) involved can no longer be unrefined below the previous highest level of uniform refinement.

Customising the distribution

The procedures described above are completely sufficient for straightforward problems, e.g. a single mesh containing single-physics elements. For less-straightforward problems, e.g. those that involve interactions between multiple meshes, the interactions must be set up both before and after the distribution. The functions




can be used to perform any additional commands required to complete the setup of the problem after distribution. In many cases, these functions will contain the same commands as those required in the equivalent Problem::actions_before_adapt() and Problem::actions_after_adapt() functions used during mesh adaptation.

Distributing problems involving FaceElements

FaceElements are typically used to apply Neumann/traction-type boundary conditions; see the tutorials that discuss the application of such boundary conditions in Poisson or Navier-Stokes equations. Since the FaceElements that apply the Neumann boundary conditions are attached to "bulk" elements that may disappear during mesh adaptation, we generally recommend to store the (pointers to the) FaceElements in a separate mesh, and to use the Problem::actions_before_adapt() and Problem::actions_after_adapt() functions to detach and re-attach the FaceElements to/from the bulk elements before and after the mesh adaptation.

The same issues arise during the problem distribution: A FaceElement that was created before the problem was distributed may have been attached to a bulk element that is deleted when the distribution is performed, resulting in obvious (and disastrous) consequences. We therefore recommend using the functions




to detach and re-attach any FaceElements before and after the problem distribution. In this context it is important to note that:

  1. The FaceElements should be available before Problem::distribute() is called to allow the load-balancing routines to take their presence into account.

  2. FaceElements that are attached to halo (bulk-)elements become halo-elements themselves.

Further details are provided in another tutorial which explains the modifications to the serial driver code required to distribute a Poisson problem with Neumann boundary conditions.

Distributing multi-domain problems

Multi-domain problems involve interactions between PDEs that are defined in different domains, such as fluid-structure interaction problems. Within oomph-lib, multi-domain problems typically involve elements, derived from the ElementWithExternalElement class, that store pointers to any "external" elements that take part in the interaction. These "external" elements are determined by helper functions such as FSI_functions::setup_fluid_load_info_for_solid_elements(...) or Multi_domain_functions::setup_multi_domain_interactions(...). The appropriate helper functions must be called in the function Problem::actions_after_distribute() to ensure that the interactions are correctly set up once the problem has been distributed.

The helper function Multi_domain_functions::locate_external_elements() has been written to work even after a problem has been distributed and uses the following algorithm:

  1. Loop over all (non-halo) ElementWithExternalElements and try to locate the "external" elements ( e.g. fluid elements adjacent to an elastic wall in an fluid-structure interaction problem) on the current processor. If the required "external" element is found locally, the ElementWithExternalElement stores a pointer to it.

  2. If the "external" element cannot be found locally, MPI-based communication is employed to find the "external" element on one of the other processors. Once found, a halo-copy of the "external" element (and its nodes) is made on the current processor and a pointer to the halo-element is stored. These "external" halo elements and nodes are stored in the appropriate mesh, i.e. in an FSI problem, the "external" fluid elements are added to the fluid mesh.

"External" halo[ed] elements are automatically included in any halo/haloed synchronisation operations performed when assigning equation numbers, or updating unknowns during the Newton iteration, etc.

The procedure discussed above has the following important consequence:

Distributing problems involving meshes with algebraic node updates

oomph-lib provides a variety of algebraic node-update methods. These allow the fast and sparse update of the nodal positions in response to changes in the domain boundaries. The shape and position of such boundaries is typically represented by one or more GeomObjects. If the motion of the boundary is prescribed, (as in the case of the flow inside an oscillating ellipse, say) no modifications are required when the meshes are used in a distributed problem.

In order to minimise communication, the design decision was taken that any GeomObjects defining the position of domain boundaries must be available on all processors after the problem is distributed. Thus, if the GeomObject is actually a MeshAsGeomObject, a compound GeomObject formed from a mesh of FiniteElements, then all the elements in the mesh, or all elements required to construct the mesh, must be retained as halo elements on every processor. This leads to a slight increase in the overall storage requirements (because none of the elements involved in the interaction are deleted when the problem is distributed) but it means that the entire GeomObject remains accessible to the fluid mesh without invoking MPI communications. Two functions can be used to specify that elements must be retained:


keeps every element in the Mesh available to every processor, and


can be called for a particular element to ensure that it is kept available to every processor.

We stress that the increase in storage requirements due to the retention of these elements is minimal because the elements are only located along the (lower-dimensional) boundaries of the domain. For instance, in the collapsible channel problem the 1D mesh of beam elements bounds the 2D mesh of fluid elements; in Turek and Hron's FSI benchmark problem, the 2D fluid domain is bounded by a 1D mesh of FSISolidTractionElements, and so on.

Examples of the implementation of these ideas are given for the flow past an elastic leaflet and Turek and Hron's FSI benchmark problem .

Further MPI Details

Trouble-shooting and debugging

Debugging and documenting the distribution

Once a problem has been distributed, the function


can be called to check that the halo lookup schemes for each mesh are set up correctly.

Details about the mesh distribution can be generated by calling

Mesh::doc_mesh_distribution(DocInfo& doc_info)

which outputs the elements, nodes, halo(ed) elements, halo(ed) nodes, mesh, boundary elements and boundary nodes on each processor. This routine is automatically called when Problem::distribute() is called with a DocInfo object whose Doc_flag is set to true (the default behaviour).

Debugging parallel code

Parallel code can obviously fail in many more ways than a code that runs on a single processor. Here is a procedure that allows basic parallel debugging without requiring access to expensive commercial tools such as totalview, say. (The instructions below assume that you use LAM as your MPI installation; they can probably be modified to work with other versions of MPI, too).

Let's assume you use gdb as your debugger. To debug a serial code with gdb you would load the executable a.out into the debugger using the command

gdb ./a.out

on the command line. Once inside gdb, you run the code by typing "run". If the code crashes, typing "where" will tell you in which line of the code the crash occurred, and it will also provide a traceback of the function calls that got you to this point.

To do this in parallel, we have to run each (parallel) instance of the code within its own gdb session. To this, create the following three files:

Then, to run the debugger in parallel on 3 processors for the executable a.out, the command would be

mpidbg -np 3 ./a.out

Once the command is issued, an xterm window will be opened for each processor, and if a crash occurs on any processor, the usual gdb commands (back, up, down, quit and so on) may be used within any of the xterm sessions where a crash has taken place.

PDF file

A pdf version of this document is available.