Synchronization refers to the time order in which parallel threads access shared or global variables.
Because several parallel threads may need to access or update the same shared variable at about the same time, explicit control of synchronization is often required. A program with faulty synchronization may compile and run without incident, but will produce incorrect results.
In each of the two previous examples each parallel thread accessed a unique portion of the shared arrays x and y . The only synchronization issue was to make sure all parallel threads completed before the master proceeded to read the wall time. This particular form of synchronization, that all parallel threads in a parallel construct terminate before the master continues, is implicit in the OpenMP parallel-for directive. No additional programming is needed to effect loop termination synchronization.
This is a modification of our second OpenMP program "arrayUpdate_II". Again, it consists of updating the elements of an array with nested for-loops. However, we now also update a shared or global variable labeled globalCount from within the parallelized loops.
globalCount is intended to count the total number of times the sin() function is called.
In this example failing to explicitly synchronize reading and writing globalCount may produce erroneous output.
/* filename arrayUpdate_III.c */
/* OpenMP tutorial example */
/* WARNING: THIS PROGRAM MAY HAVE A SYNCHRONIZATION PROBLEM! */
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <omp.h> /* Here's the header file for OpenMP. */
double walltime( double* ); /* the clock on the wall */
int main( int argc, char *argv[] )
{
/* Arguments required for executing argv[0]: */
/* 0 -> (executable) */
/* 1 -> arraySize */
double *x, *y; /* the arrays */
int arraySize; /* get from the command line */
int i;
/* Additional variables for arrayUpdate_II: */
int privIndx;
double privDbl;
/* Additional shared variable for arrayUpdate_III: */
int globalCount;
double startTime, elapsedTime; /* for timing */
double clockZero = 0.0;
arraySize = (int) atof( argv[1] );
/* Check the command line arguments. */
if (argc != 2)
{
fprintf( stdout, "usage: %s arraySize\n", argv[0] );
exit(-1);
}
/* Allocate memory for the arrays. */
x = (double *) malloc( (size_t) ( arraySize * sizeof(double) ) );
y = (double *) malloc( (size_t) ( arraySize * sizeof(double) ) );
/* Initialize x with some junk. */
for ( i = 0; i < arraySize; i++ )
{
x[i] = ( (double) i ) / ( i + 1000 );
}
/* Now begins the real work which we want to parallelize. */
/* Initialize the counter that resides in the global data space. */
globalCount = 0;
/* Mark the starting time. */
startTime = walltime( &clockZero );
/* Here's the OpenMP pragma that parallelizes the for-loop. */
/* This parallel construct has 3 private variables: */
/* 1) outer loop index "i" by default */
/* 2) inner loop index "privIndx" by explicit declaration */
/* 3) variable "privDbl" by explicit declaration */
#pragma omp parallel for private( privIndx, privDbl )
for ( i = 0; i < arraySize; i++ )
{
for ( privIndx = 0; privIndx < 16; privIndx++ )
{
privDbl = ( (double) privIndx ) / 16;
y[i] =
sin( exp( cos( - exp( sin(x[i]) ) ) ) ) + cos( privDbl );
/* globalCount totals the number of times sin() is called. */
/* Here, each thread must read globalCount, add 1 to the */
/* value, and write the new value back to globalCount. */
globalCount = globalCount + 1; /* same as globalCount++ */
}
}
/* Work's done. Get the elapsed wall time. */
elapsedTime = walltime( &startTime );
/* Print the global counter. */
fprintf( stdout, "\nthe global counter should equal %d\n", arraySize*16 );
fprintf( stdout, "globalCount = %d\n", globalCount );
/* Print the wall time and terminate. */
fprintf( stdout, "\nwall time = %.2fs for array size = %.0e\n",
elapsedTime, (double) arraySize );
fprintf( stdout, "\nProgram successfully terminated.\n\n\a" );
return(0);
}
This code makes a hidden assumption that is not generally valid. The code assumes that each thread will read globalCount , add one to the value, then write the updated value back to globalCount before any other parallel thread interferes.
Assume the value of globalCount is 1. The invalid assumption presumes the following scenario always occurs:
Although this is precisely what we want to happen, it may not actually occur. There is nothing in the code to prevent the following scenario:
Both threads read and write the same value for globalCount . As a result the program undercounts the number of times the sin() function has been called.
Let's run the program and see what happens. Copy all the above code to a file named "arrayUpdate_III.c" in your "OpenMPtutorial/ArrayUpdate/" directory.
Compile the parallel executable with:
make arrayUpdate_III
Remember that we have chosen to set the total number of parallel threads via the environment line of the "execution_script" to be submitted to LoadLeveler:
#@ environment = COPY_ALL; OMP_NUM_THREADS=8
Also in your "execution_script" make sure you're correctly identifying the parallel executable:
#@ initialdir = ~/OpenMPtutorial/ArrayUpdate
#@ executable = arrayUpdate_III
#@ arguments = 1E6
#@ output = arrayUpdate_III.out
Here let's run with an array size of 1,000,000 compared to 10,000,000 in previous runs.
Make sure "execution_script" is executable by typing chmod 744 execution_script , which also makes it readable by all.
To run the program, submit the "execution_script" to LoadLeveler. At your shell prompt type llsubmit execution_script .
Once you've submitted your job to LoadLeveler, you will want to view the job's status. Refer back to Compiling and running the parallel "arrayUpdate" program on how to track your job.
When your run is completed the standard output from the "arrayUpdate_III" program will be in a file named "arrayUpdate_III.out".
Type more arrayUpdate_III.out to view the standard output. It should look like this:
the global counter should equal 16000000 globalCount = 2143844 wall time = 21.99s for array size = 1e+06 Program successfully terminated.
We had 16,000,000 calls to the sin() function but counted only 2,143,844. Clearly a synchronization problem exists in the reading and writing of globalCount .
In the above program we have what is said to be a "race condition." Parallel threads race against each other without regard for correctness.
To help eliminate race conditions of this sort OpenMP provides programmers the critical directive. This directive gives each parallel thread exclusive access to a shared variable until its current update is complete:
#pragma omp critical
{
. . .
}
The code enclosed in the brackets can be executed by only one parallel thread at a time. When a thread is executing the bracketed code, another thread must wait until the former has moved beyond the bracketed code.
Modify "arrayUpdate_III.c" by enclosing the globalCount update by a critical directive:
#pragma omp critical
{
globalCount = globalCount + 1; /* same as globalCount++ */
}
Recompile and resubmit to LoadLeveler. When your run is completed type more arrayUpdate_III.out to view the standard output. It should look like this:
the global counter should equal 16000000 globalCount = 16000000 wall time = 104.23s for array size = 1e+06 Program successfully terminated.
The value of globalCount is now correct. However, because the critical directive forces the parallel threads to wait in line to access and update globalCount , the wall time is substantially increased -- by a factor of about 5. This unhappy situation exemplifies they key danger of OpenMP programming:
Excessive synchronized accessing of shared variables within parallel constructs will eliminate the benefits of parallelization, and may actually increase wall time over that of serial code.
The OpenMP programmer must be aware of the synchronization requirements of shared variable accesses, and must design code that minimizes the need for these synchronized accesses.
In the above program we encountered a race condition that produced erroneous output. We corrected the situation with a synchronization fix but which resulted in excessive runtime. Is there a way to reformulate the algorithm to run correctly but with an acceptable runtime?
In this elementary example such a reformulation is immediately apparent. We can declare a private variable that each thread increments. When the threads are finished with their loops, they synchronously update globalCount by adding to it their private variable values. The use of private variables obviates the need for synchronization. The number of synchronized accesses to globalCount is limited to 8 (for 8 parallel threads) versus 16,000,000 as in the preceding code.
For convenience the OpenMP API includes a clause that effects this operation. It is the reduction clause.
The term "reduction" refers to repeated updating of a variable via +, -, *, or a binary logical operator, with some other value. As will be apparent in the following, these binary operators must be commutative; division is not permitted as a reduction operator.
A reduction variable is a hybrid shared/private variable. It is initialized prior to the parallel construct like a shared or global variable. Within the parallel construct it behaves as a private variable. But at the conclusion of the parallel construct the threads' private values are amalgamated per the designated binary operation into the original shared variable.
In the above code we may treat globalCount as a reduction variable by coding the parallel construct as follows:
#pragma omp parallel for private( privIndx, privDbl ) \
reduction( + : globalCount )
for ( i = 0; i < arraySize; i++ )
{
for ( privIndx = 0; privIndx < 16; privIndx++ )
{
privDbl = ( (double) privIndx ) / 16;
y[i] =
sin( exp( cos( - exp( sin(x[i]) ) ) ) ) + cos( privDbl );
/* Here, each thread must read globalCount, add 1 to the */
/* value, and write the new value back to globalCount. */
globalCount = globalCount + 1; /* same as globalCount++ */
}
}
Here we have added the reduction clause to the parallel-for directive (using the "\" continuation character). The critical directive is no longer necessary for synchronization because the variable globalCount is treated privately during the execution of the loops. But at the conclusion of the parallel-for construct the threads' private values for globalCount are summed per the designated binary operation "+" into the original shared globalCount . The reduction clause is simply a shortcut for coding a reduction without having to explicitly code a private variable plus a final synchronized amalgamation.
Modify your "arrayUpdate_III.c" parallel-for construct to incorporate the reduction clause as shown above.
Recompile and resubmit to LoadLeveler. When your run is completed type more arrayUpdate_III.out to view the standard output. It should look like this:
the global counter should equal 16000000 globalCount = 16000000 wall time = 2.44s for array size = 1e+06 Program successfully terminated.
The results are correct, and the runtime is much improved because of the near elimination of accesses to a shared variable.