We conclude this chapter by presenting four examples of parallel algorithms. We do not concern ourselves here with the process by which these algorithms are derived or with their efficiency; these issues are discussed in Chapters 2 and 3, respectively. The goal is simply to introduce parallel algorithms and their description in terms of tasks and channels.
The first two algorithms described have an SPMD structure, the third creates tasks dynamically during program execution, and the fourth uses a fixed number of tasks but has different tasks perform different functions.
Figure
1.11: A parallel algorithm for the one-dimensional finite
difference problem. From top to bottom: the one-dimensional vector X ,
where N=8 ; the task structure, showing the 8 tasks, each encapsulating
a single data value and connected to left and right neighbors via channels; and
the structure of a single task, showing its two inports and outports.
We first consider a one-dimensional finite difference problem, in which we
have a vector of size N and
must compute
, where
That is, we must repeatedly update each element of X , with no element being updated in step t+1 until its neighbors have been updated in step t .
A parallel algorithm for this problem creates N tasks, one for each
point in X . The i th task is given the value and is responsible for
computing, in T steps, the values
. Hence, at step
t , it must obtain the values
and
from
tasks i-1 and i+1 . We specify this data transfer by defining
channels that link each task with ``left'' and ``right'' neighbors, as shown in
Figure 1.11, and
requiring that at step t , each task i other than task 0 and
task N-1
Figure
1.12: Task structures for computing pairwise interactions for
N=5 . (a) The unidirectional ring used in the simple, nonsymmetric
algorithm. (b) The unidirectional ring with additional channels used to return
accumulated values in the symmetric algorithm; the path taken by the accumulator
used for task 0 is shown as a solid line.
Our second example uses a similar channel structure
but requires a more complex communication algorithm. Many problems require the
computation of all N(N-1) pairwise interactions ,
, between N
data,
. Interactions may be
symmetric, in which case
and only
N(N-1)/2 interactions need be computed. For example, in molecular
dynamics we may require the total force vector
acting on each atom
, defined as follows:
Each atom is represented by its mass and Cartesian coordinates. denotes the mutual
attraction or repulsion between atoms
and
; in this
example,
, so interactions are
symmetric.
A simple parallel algorithm for the general pairwise interactions problem
might create N tasks. Task i is given the datum and is responsible for
computing the interactions
. One might think that as
each task needs a datum from every other task, N(N-1) channels would be
needed to perform the necessary communications. However, a more economical
structure is possible that uses only N channels. These channels are
used to connect the N tasks in a unidirectional ring (Figure 1.12(a)).
Hence, each task has one inport and one outport. Each task first initializes
both a buffer (with the value of its local datum) and an accumulator that will
maintain the result of the computation. It then repeatedly
It turns out that if interactions are symmetric, we can halve both the number
of interactions computed and the number of communications by refining the
communication structure. Assume for simplicity that N is odd. An
additional N communication channels are created, linking each task to
the task offset around the ring (Figure
1.12(b)).
Each time an interaction
is computed between a
local datum
and an incoming datum
, this value is
accumulated not only in the accumulator for
but also in another
accumulator that is circulated with
. After
steps,
the accumulators associated with the circulated values are returned to their
home task using the new channels and combined with the local accumulators.
Hence, each symmetric interaction is computed only once: either as
on the node that holds
or as
on the
node that holds
.
The next example illustrates the dynamic creation of tasks and channels during program execution. Algorithm 1.1 explores a search tree looking for nodes that correspond to ``solutions.'' A parallel algorithm for this problem can be structured as follows. Initially, a single task is created for the root of the tree. A task evaluates its node and then, if that node is not a solution, creates a new task for each search call (subtree). A channel created for each new task is used to return to the new task's parent any solutions located in its subtree. Hence, new tasks and channels are created in a wavefront as the search progresses down the search tree (Figure 1.13).
Figure
1.13: Task structure for the search example. Each circle represents
a node in the search tree and hence a call to the search procedure. A
task is created for each node in the tree as it is explored. At any one time,
some tasks are actively engaged in expanding the tree further (these are shaded
in the figure); others have reached solution nodes and are terminating, or are
waiting for their offspring to report back with solutions. The lines represent
the channels used to return solutions.
In so-called embarrassingly parallel problems, a computation consists of a number of tasks that can execute more or less independently, without communication. These problems are usually easy to adapt for parallel execution. An example is a parameter study, in which the same computation must be performed using a range of different input parameters. The parameter values are read from an input file, and the results of the different computations are written to an output file.
Figure
1.14: Task structure for parameter study problem. Workers (W)
request parameters from the input task (I) and send results to the output task (O). Note the many-to-one connections: this
program is nondeterministic in that the input and output
tasks receive data from workers in whatever order the data are generated. Reply
channels, represented as dashed lines, are used to communicate parameters from
the input task to workers.
If the execution time per problem is constant and each processor has the same computational power, then it suffices to partition available problems into equal-sized sets and allocate one such set to each processor. In other situations, we may choose to use the task structure illustrated in Figure 1.14. The input and output tasks are responsible for reading and writing the input and output files, respectively. Each worker task (typically one per processor) repeatedly requests parameter values from the input task, computes using these values, and sends results to the output task. Because execution times vary, the input and output tasks cannot expect to receive messages from the various workers in any particular order. Instead, a many-to-one communication structure is used that allows them to receive messages from the various workers in arrival order.
The input task responds to a worker request by sending a parameter to that worker. Hence, a worker that has sent a request to the input task simply waits for the parameter to arrive on its reply channel. In some cases, efficiency can be improved by prefetching , that is, requesting the next parameter before it is needed. The worker can then perform computation while its request is being processed by the input task.
Because this program uses many-to-one communication structures, the order in which computations are performed is not necessarily determined. However, this nondeterminism affects only the allocation of problems to workers and the ordering of results in the output file, not the actual results computed.
© Copyright 1995 by Ian Foster