The first step is to identify a chunk of code that is doing the big part of the computation...
In this elementary example it is apparent that most of the work is done by the for-loop. Therefore, the for-loop is the segment of code we should divide up for simultaneous work by separate processors.
The lower and upper limits, and/or the setpsize, specified for the for-loop can be made to be functions of each processor's ID number, myrank , and the total number of processors, numProcs .
A nice way of partitioning the data is to rewite the for-loop limits and stride as follows:
Serial code: for ( interval = 0; interval < numberOfIntervals; interval++ )
==>
Parallel code: for ( interval = myrank; interval < numberOfIntervals; interval += numProcs )
You may want to experiment with these limits and stride to verify that indeed all the intervals are distributed among the processors as evenly as possible and that no two processors are computing the same interval. This kind of clever use of myrank and numProcs is characteristic of data decomposition in parallel programming.
Anyway, before we go into more details, here's the parallel version of our program to compute PI, written in C:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <mpi.h>
const double pi = 3.1415926535897932385;
int main( int argc, char *argv[] ) {
/* This is a parallelization of serialPi.c. */
/* The for-loop is partitioned among the processors. */
/* Function MPI_Reduce sums the processors' results for processor 0. */
/* Processor 0 outputs. */
/* */
/* Remember, it is convenient to reference the typical processor in the */
/* first person, using "I" and "me" and "my." In other words, */
/* read through this program as if you were one of the processors. */
/* Arguments required for executing argv[0]: */
/* 0 -> serialPi */
/* 1 -> numberOfIntervals */
/* My ID number and the total number of processors: */
int myrank, numProcs;
/* Variables for the computation. */
int numberOfIntervals, interval;
double intervalWidth, intervalMidPoint, totalArea, myArea = 0.0;
/* For completeness I'll declare an MPI status variable, */
/* although I don't plan to use it. */
MPI_Status status;
/* Initialize the MPI API: */
MPI_Init(&argc, &argv);
/* Request my ID number: */
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
/* I'll also ask how many other processors are out there: */
MPI_Comm_size(MPI_COMM_WORLD, &numProcs);
/* Okay. The preparations have been made. */
/* The number of intervals is a command line argument. */
numberOfIntervals = atoi( argv[1] );
/* Compute the interval width. */
intervalWidth = 1.0 / numberOfIntervals;
/* Now I'll compute my area. */
for ( interval = myrank; interval < numberOfIntervals; interval += numProcs ) {
intervalMidPoint = (interval + 0.5) * intervalWidth;
myArea += 4.0 / ( 1.0 + intervalMidPoint*intervalMidPoint );
}
/* Okay. Now I'll submit my results to the MPI API. */
/* If I'm processor 0, I'll receive the total area. */
/* Otherwise, I'm done. */
MPI_Reduce( &myArea,
&totalArea,
1,
MPI_DOUBLE,
MPI_SUM,
0,
MPI_COMM_WORLD );
/* If I'm processor 0, I need to normalize the total area and output. */
if (myrank == 0) {
totalArea *= intervalWidth;
printf( "The computed value of the integral is %.15f\n", totalArea );
printf( "The exact value of the integral is %.15f\n", pi );
}
/* Close the MIP API: */
MPI_Finalize();
return 0;
}
You should be able to notice that once the processors are done computing their portions, their results need to be combined somewhere and summed. All processors could send their results to processor 0 who could sum and print the result; This communication could be accomplished by coding MPI_Send and MPI_Recv functions for each processor.
Fortunately, the MPI API provides a cleaner way of coding using the MPI_Reduce () function, which in C is:
int MPI_Reduce( void* sendBuf,
void* recvBuf,
int count,
MPI_Datatype datatype,
MPI_Op operation,
int root,
MPI_Comm comm )
sendBuf is the send buffer or array packed with the data to be sent by each sending processor.
recvBuf is the receive buffer or array into which the MPI API will place the data transmitted and operated upon at the receving processor.
count is the length of both the recvBuf array and the sendBuf array.
datatype is the datatype of both the recvBuf and sendBuf arrays, typically MPI_INT, MPI_FLOAT , or MPI_DOUBLE .
operation is the operation (reduction) to be done by the MPI API on the sending processors' data before placing it in the receiving processor's recvBuf.
MPI_SUM is not the only operation you can use with MPI_Reduce. MPI provides a number of predefined reduction operations that can be used in this context, and you can define your own operations too. The predefined ones are:
MPI_MAX MPI_MIN MPI_SUM MPI_PROD MPI_LAND MPI_BAND MPI_LOR MPI_BOR MPI_LXOR MPI_BXOR MPI_MAXLOC MPI_MINLOC
root is the rank of the receiving processor.
comm is the communication domain, we will always specify the default domain MPI_COMM_WORLD which consists of all processors.
Note: If you did not copy example programs from HPC's account in the initial part, then copy the above source code as parallelPi.c to your ParallelPi directory.
Exercise: Do this!
To compile the parallelPi program, at your shell prompt (head node bhX) in the ParallelPi directory type
[agopu@bh2 agopu]$ cd ~/MPI_Tutorial/ParallelPi/ [agopu@bh2 ParallelPi]$ makeThe ls -l command will show your new executable parallelPi .
To run the program on interactive nodes you got through qsub -I (bcXX), do:
[agopu@bc81 agopu]$ cd ~/MPI_Tutorial/ParallelPi/ [agopu@bc81 ParallelPi]$ lamboot $PBS_NODEFILE [agopu@bc81 ParallelPi]$ mpirun C parallelPi 20000 [agopu@bc81 ParallelPi]$ lamhaltYou can expect to see something like this:
The computed value of the integral is 3.141592654423124 The exact value of the integral is 3.141592653589793 [0] Intel Trace Collector INFO: Writing tracefile parallelPi.stf in /N/u/agopu/MPI_Tutorial/ParallelPi
Non-interactive job submission: To submit a PBS job to run the above program, use the submit_parallelPi.sh shell script within your ParallelPi directory.
Note: If you did not copy examples programs from the HPC account then, modify the qsub job submission script so that mpirun will execute the parallelPi program.
. . .
mpirun C parallelPi 20000
. . .
Once you've done that, save the script as submit_parallelPi.sh and then do the job submission.
[agopu@bh2 ParallelPi]$ pwd /N/u/agopu/MPI_Tutorial/ParallelPi/ [agopu@bh2 ParallelPi]$ qsub submit_parallelPi.sh
Once again, you can monitor the progress of your job using qstat. This among other PBS related details are explained in the Working with Portable Batch Scheduler (PBS) section.
If you submitted the job to PBS non-intereactively, then check the output and error files for messages your program or PBS may have given out.
[agopu@bh2 ParallelPi]$ pwd /N/u/agopu/MPI_Tutorial/ParallelPi/ [agopu@bh2 ParallelPi]$ ls -latr submit_*.o* submit_*.e* -rw------- 1 agopu hpc 184 Dec 20 15:50 submit_parallel.o286288 [agopu@bh2 ParallelPi]$ less submit_parallel.o286288 The number of processors in this run is 4. The computed value of the integral is 3.141592654423124 The exact value of the integral is 3.141592653589793 [0] Intel Trace Collector INFO: Writing tracefile parallelPi.stf in /N/u/agopu/MPI_Tutorial/ParallelPi
| Previous: Computing PI using Serial... | Up: Table of Contents | Next: The 2-D Diffusion Equation |
|---|