Debugging DVM-program performance. User's guide

Debugging DVM-program performance
User's guide
* 06.09.00 *

Contents

1 Introduction
2 Characteristics of program execution

2.1 Main characteristics of program execution
2.2 Components of the main characteristics
2.3 Program execution characteristics on each processor

3 Methodology of performance debugging

3.1 Representation of program as a hierarchy of intervals
3.2 Recommendations on characteristics analysis

4 Start of execution with statistics accumulation
5 Start of performance analyzer
6 Representation of characteristics
Appendix. The list of characteristics

1 Introduction

The performance of parallel program execution on multiprocessor computers with the distributed memory is determined by the following major factors:

program parallelism - a part of parallel calculations in total volume of calculations;
balance of processor load during parallel calculations;
time of interprocessor communications;
degree of overlapping of interprocessor communications and calculations.

Methods and tools of parallel program performance debugging essentially depend on the model used for parallel program.

An essential advantage of DVM-system is that at any moment it is known whether sequential or parallel part of the program is executed on any processor. Besides, all program points of synchronization operations are known. Therefore there is an opportunity to quantify the influence of four above factors on the program execution performance.

Special tools were developed for analysis and debugging of performance of DVM-program execution. They work as following. During a program execution on multiprocessor computer (or uniform computer network) the support system stores time characteristic information in processor memory and writes the data into a file upon the program completion. Then the file is processed on workstations in Windows 95/NT or UNIX environment using a special performance visualizer.

The performance visualizer allows the user to get time characteristics of the program execution in more or less detail.

2 Characteristics of program execution

2.1 Main characteristics of program execution

The opportunity to distinguish sequential and parallel parts of the program during its execution on the multiprocessor computer allows to predict a productive time required for the program execution on serial computer. So the main characteristic of parallel execution (efficiency coefficient) can be calculated: it is a ratio of the productive time to the total processor time. The total processor time is calculated as a product of execution time on the multiprocessor computer (execution time – maximum of program execution times on all processors used) by number of processors. The lost time is the total processor time of parallel execution subtracted by the productive time. If the programmer is not satisfied with the efficiency coefficient value he should analyze components of the lost time and their origin.

There are following components of the lost time:

Losses because of insufficient parallelism that cause replication of execution on several processors (insufficient parallelism); the losses are possible in two cases: first, sequential parts of the program are executed on all processors; second, some parallel loop iterations can be replicated on all or several processors according to user prescription;
Losses because of execution of interprocessor communication (communication);
Losses because of idle time of the processors on which the program execution has been completed earlier than on others (idle).

Time of interprocessor communications includes the time of data transfer from one processor to another and also includes a time lost because of message receive operation on one processor starting earlier than the corresponding send message operation on another one (dissynchronization losses). Since the DVM user does not deal with low-level operations, as message passing the dissynchronization information should be represented in form convenient to him.

During DVM-program execution interprocessor message exchanges are generated by the following collective operations:

global reduction (reduction operation with reduction variable local values, calculated on different processors);
shadow edges renewing;
loading remote access buffers;
data redistribution;
input/output operations (they are executed on one dedicated processor which receives data from and sends to others).

There are two modes of collective operation execution: synchronous and asynchronous. To execute a collective operation in asynchronous mode that provides a simultaneous execution of interprocessor communications and calculations, functions of start and waiting operation are served. If one of the operations listed above starts not simultaneously on different processors (in synchronous or asynchronous mode) then dissynchronization losses may occur. To estimate such losses the dissynchronization losses for each collective operation are calculated that is a time spent on synchronization by all processors as if any collective operation starts with processors synchronization. Overhead losses on synchronization message exchange are not taking into account.

A special characteristic –synchronization – is used to estimate total potential losses due to non-simultaneous start of collective operations on different processors.

The user should pay attention to the main origin of dissynchronization losses – processor loading imbalance. The loading imbalance is caused by non-uniform distribution of parallel loop calculations between processors.

If processor synchronization (interprocessor communication) would be performed upon each entering and exiting of parallel loop then processor loading imbalance would inevitably lead to dissynchronization losses. However as such synchronization is performed not for all loops then imbalances on different program segments can be compensated and real losses can be insignificant or even absent. To estimate possible imbalance losses the user is given a generalized characteristic- imbalance. To minimize overhead losses for calculation of this characteristic it is assumed that processor synchronization is performed only once – upon exiting the program. So a total load of each processor is calculated first and then imbalance losses due to dissynchronization are predicted. However in real program processor synchronization is performed not only upon the exiting but more often, so the real losses will be higher. The real dissynchronization losses will be still higher than imbalance value in case when processor load is strongly varies from one execution of parallel loop to another execution of the same loop.

Dissynchronization can occur not only due to imbalance but also because of different moments of collective operation completion caused by characteristics of its realization on a parallel computer. To evaluate the potential dissynchronization the user is provided with a special characteristic – time variation of collective operation completion. As the imbalance time this characteristic is an integral one. It allows user to quite accurately estimate possible losses due to dissynchronization in case when different execution time of collective operations are not random but are determined by network topology or processor specialization (input/ output processor, processor for reduction operations etc.).

An important characteristic showing potential reducing of communications by overlapping interprocessor exchanges and computations is time of overlapping.

The main characteristics of effectiveness are integral characteristics allowing user to estimate parallelization degree and potential of its increase. However to estimate the effectiveness of complex programs the integral characteristics can be not sufficient. In this case more detail information on execution of the whole program and it parts can be provided for user.

2.2 Components of the main characteristics

Some of the above main characteristics consist of several components and its values can be given to user.

Productive time consists of productive processor time, system processor time and input/output operation tine (not taking into account message exchange).

Insufficient parallelism loses consist of two components giving a possibility to distinguish losses in user program and corresponding system overheads.

Communication time includes time of dissynchronization loses and time used for start asynchronous collective operations, this time should be too small as compared with communication time, if communication libraries (MPI, PVM) work correctly.

To refine the communication time it is decomposed into following components:

time of reduction operation
time of renewing shadow edges;
time of loading remote access buffers;
time of data redistribution;
time of message exchange during input/output operations.

Real and potential dissynchronization losses and the losses due to variation in time of collective operation completion are decomposed in the same manner.

The time of exchanges and computations overlap is calculated for all asynchronous collective operations.

2.3 Program execution characteristics on each processor

The calculation of the main integral characteristics and its components are based on program execution characteristics on each processor. These characteristics can be useful for more detail analysis of parallel program execution effectiveness. Besides the values of these characteristics its average, maximal and minimal values and the corresponding processor are given.

3 Methodology of performance debugging

For effectiveness analysis of complex parallel program execution it is not sufficient to have characteristics of the whole program execution but detail characteristics of chosen program parts are needed. The execution of DVM-program can be represented as a hierarchy of intervals and the tools to do that and recommendations on characteristic analysis are described below.

3.1 Representation of program as a hierarchy of intervals

Program execution is considered as an interval of the highest level (zero level). This interval can include several intervals on the next (first) level. Such intervals can be parallel loops, sequential loops as well as any sequence of operations marked by user for which the execution starts from the first statement and completes with the last statement. The intervals of the first level can in turn include intervals of the second level etc.

All above characteristics are computed not only for the whole program but also for each its interval. Multiple interval execution can be considered as unrolled sequence of interval statements on the same processors as during real execution of the parallel program. In fact the characteristics of the interval executed several times are added up after each execution. The intervals included into the interval of higher level are identified by the source file name and a line number in it corresponding to the beginning of the interval and may be user defined integer number.

User controls program splitting into intervals during compilation. There are the following options:

-e1 – the intervals are all parallel loops and sequential loops embedding them;
-e2 – the intervals are all marked sequences of statements;
-e3 – concatenation of the first two options ( e1 and e2);
-e4 – the intervals are all parallel and sequential loops and marked sequences of statements.

To mark sequence of statements as an interval two special C-DVM or FORTRAN-DVM instructions are used:

In C-DVM the interval is defined as follows:

DVM(INTERVAL[integer expression ])<statement>,

and in FORTRAN-DVM:

CDVM$ INTERVAL[integer expression]
. . .
CDVM$ ENDINTERVAL

For example, marking loop body as an interval and prescribing integer expression as a loop counter each loop iteration will be represented as separate interval. In the same manner characteristics of even and odd loop iteration or characteristics of procedure execution with given parameters can be obtained.

3.2 Recommendations on characteristics analysis

While developing parallel program user as a rule has one of two possible target – solve the problem in acceptable time or create an efficient program for solving a class of problems on different parallel computers.

In the first case if the execution time is acceptable then other characteristics can be not interesting for the user. In the second case the main characteristic for user is coefficient of parallelism efficiency. If execution time or coefficient of parallelism efficiency does not satisfy the user then the lost time and its components should be analyzed.

Before proceeding to recommendation on analysis let us make some notes.

First, the calculation of lost time (as well as coefficient of parallelism efficiency) is not based on real time of execution on one processor but on predicted time. This predicted time may differ from the real one.

Real time may be greater than predicted one because the same calculations can be executed slower on one processor than on several processors. The explanation of that is: when the volume of data used in calculations changes then the speed of access to data through cache-memory changes too. Since modern processor performance depends on effectiveness of cache–memory usage, the real time can noticeably exceed the predicted time.

Real time may be less than predicted one because not all overhead losses of parallel program execution are taken into account in predicted time. Such losses (for example, losses for search in system tables) may occur when some frequently used functions are executed and it is impossible to calculate the time of their execution without introduction unacceptable perversions in program execution. These extra losses may be reduced in case of program execution on one processor.

As a result of influence of cache-memory usage efficiency and overhead system losses the user will get different values of productive time on different configurations of parallel computer. So it is desirable to execute program on one processor (when it is possible, as it may take much more memory than one processor has) to understand differences between real and predicted times.

Second, parallel DVM-program execution time may essentially differ from the time of a sequential program execution. It can be of following origins:

Access to distributed data in parallel program differs from that in sequential ones. The parallel program execution time may be increased by additional overhead losses by 10-30 percents. However in parallel program data access optimization can be done which results in speed up of parallel program execution as compared to sequential one.
Translation of DVM-programs into standard Fortran77 or C – programs may lead to differences in program optimizations by standard translators. As a result the parallel program may execute slower or faster. Characteristics of modern optimizing compiler influence execution performance very much (100 – 200 percents).
Some overhead losses due to parallel program execution support can considerably slow down the program execution (for example, allocation and free memory operations in sequential program may be transferred in very complicated construction of creating and deleting distributed arrays in parallel program).

Therefore it is desirable to execute program as sequential one on one processor (if it is impossible to do on parallel computer it may be possible on workstation).

If parallel execution time and sequential execution time are considerably different programmer can use the following DVM-system possibilities.

DVM-program can be compiled in a special mode such that it will not much differ from sequential program (however it is necessary to control the influence of these differences on execution time) but will contain tools for collecting time characteristics in different intervals. User can get sequential execution characteristics and compare them with corresponding characteristics of parallel execution on one processor.

User should take into account the facts mentioned above when analyzing the lost time and its components.

At first three lost time components for zero interval (the whole program) should be estimated. Probably main part of the lost time is one of two first components (insufficient parallelism or communications).

If the main losses are due to insufficient parallelism user should find out whether it appears in parallel or sequential parts. In case of parallel parts wrong definition of processor matrix or wrong data or calculation distributions may have an effect on the lost time. If insufficient parallelism was found in sequential parts a sequential loop executing a great volume of calculations may be the cause. But removing such causes may take a lot of efforts.

If the main losses are due to communications user should pay attention to dissynchronization losses. If these losses are substantial it is necessary to consider imbalance characteristic, as just imbalance of parallel loop calculations is the main cause of dissynchronization and great communication losses. If imbalance value is much less than synchronization value user should pay attention to time variation for collective operations. If dissynchronization is not a consequence of time variation of completion of collective operations it may be caused by imbalance of some parallel loops which in the considered program execution interval may be mutually compensated. So it makes sense to consider imbalance characteristics in intervals of lower levels.

The second probable cause of great dissynchronization losses may be processor dissynchronization that can occur even if input/output operations start simultaneously. This happens because the main job (operation system input/output function calls) is executed on input/output processor while the rest of processors are waiting for data from I/O processor or information about collective operation completion. This cause can be easily revealed if user considers the corresponding communication component – losses because of input/output communications.

Delay in asynchronous collective operation start may cause great communication losses. In this case user should refer to person responsible for maintenance of communication library, used by DVM-system.

A large number of reduction operations or operations loading data from other processors (renewing shadow edges or remote access) may be a main cause of communication losses. In this case user can reorganize the program to unite reduction operations or renewing shadow edges operations into group operations.

There is another approach for characteristic analysis when first, efficiency coefficients and lost time in first level intervals are analyzed and then they are analyzed in second level intervals etc. As a result a critical interval will be found and user will be able to concentrate his efforts on its characteristic analysis. It is necessary to take into considerations that interval dissynchronization losses and interval idle losses may be caused by not only imbalance and time variation on this very interval but by imbalance and time variation on other previous intervals too.

While debugging program performance the user does not need to perform total volume of calculations, as it will be when the program is used for real tasks. For example, the user can limit the number of regularly repeated external iterations to one or two. The efficiency coefficient depending on losses in intervals that are executed before the first iteration or after the last iteration may be considerably reduced. However the user can define the external iteration execution as a separate interval and then debug its performance as a performance of the whole program according to above methods.

4 Start of execution with statistics accumulation

When DVM-program is started on multiprocessor computer or on workstation network to accumulate statistics on DVM-program performance a parameter Is_DVM_STAT should be equal to 1.

After the program completion a file with name ‘sts’ is created. The length of the file is product of statistics buffer size and the number of processors used for the program execution.

Changing parameters it is possible to change the length of statistics buffer (StatBufLength), where each interval execution characteristics and maximal interval nesting level (MaxIntervalLevel) are saved. Reducing interval nesting level the user can reduce the number of intervals for which statistics are collected, and so the volume of statistics will be reduced.

If parameter IsTimeVariation is equal to 1, then statistics buffer is also used for saving information about times of start and completion of all collective operations. These times are used by performance visualizer to calculate potential dissynchronization and time variation losses and also to find out potential reducing of communications due to overlapping interprocessor exchanges and computations. If there is no enough buffer space to save information of all executed collective operations a warning message is output. User should take into account that performance visualizer cannot use full information while calculating above characteristics.

If errors were detected in the process of data gathering the file may be created in any case, and error message will be output into file or on screen.

List of messages:

Statistics: not enough memory for interval, data were not wrote to the file,

Statistics: number of ends of interval > number of begins of interval, data were not wrote to the file,

Statistics: end of interval nline = <N>, name = <name>, no end nline = <N> name =<name>, data were not wrote to the file,

Statistics: StatBufLength=<length>, increase buffer's size by <N> bytes, data were not wrote to the file,

Statistics: StatBufLength=<length>, not enough memory for times of collective operations, increase buffer's size by <N> bytes, only part of times of collective operations and all intervals were wrote to the file.

Statistics warning: used return or goto, times may be incorrect

5 Start of performance analyzer

To get time characteristics for intervals user should execute the following command:

dvm pa sts <file name> [[[<ch1> <ch2> <ch3>] <level>] <numbers>]

– output file name,
– y/n output of general characteristics,
– y/n output of comparative characteristics,
– y/n output characteristics for processors,
– nesting level number,
– list of processor numbers, for which characteristics should be output.

To get more information of the command parameters user can execute

dvm pa –h

6 Representation of characteristics

All characteristics are written into text file which name is defined by the user in the command string of performance analyzer. For each interval the following information is saved:

name of source file with DVM-program and first interval statement number (SOURCE, LINE)
interval type – whole program, parallel loop (PAR), sequential loop (SEQ) or marked by user sequence of statements (USER);
interval level number (LEVEL);
the number of entrances (and exits) in the interval (EXE_COUNT);
value of expression defined when describing interval (EXPR);
main execution characteristics and their components (Main characteristics);
minimal, maximal and average program execution characteristics on every processor (Comparative characteristics);
program execution characteristics on every processor (Execution characteristics on processors).

When characteristics are output their components are placed in the same line (to the right in brackets), or in the next line (to the right of symbols “*” or “-“).

Components of some characteristics connected with collective operation execution output as columns of table where lines correspond to the type of collective operation and columns are characteristics. One column (Nop) contains the number of operations of every type, that is characteristics not depending on the number of processor used for the program execution.

Information about minimal, maximal and average characteristics is saved in the table in the same way.

User can reduce the volume of output information prescribing needed types of characteristics. Besides it is possible to restrict the number of intervals prescribing the maximal interval level. User can also define the list of processor numbers for which execution characteristics will be saved. Some characteristics are not saved if their value is equal to zero.

Below there is an output example of Jacobi Fortran-DVM-program characteristics of execution on 4 workstations SGI O2. Size (L) of arrays A and B is equal to 1200, the number of iterations is equal to 4. Results (array B) are not written into a file.

Characteristics (Main characteristics and Comparative characteristics) are represented only for zero interval.

PROGRAM    JACOB
        PARAMETER    (L=1200,  ITMAX=4)
        REAL     A(L,L), EPS, MAXEPS, B(L,L)
CDVM$   PROCESSORS    P(2,2)

CDVM$	DISTRIBUTE   A  ( BLOCK,   BLOCK)   ONTO   P 
CDVM$	ALIGN   B( I, J )  WITH  A( I, J ) 

C	arrays A and B  with block distribution 

        PRINT *,  '**********  TEST_JACOBI   **********'

                  MAXEPS  =  0.5E - 7
CDVM$   PARALLEL    (J,I)   ON   A(I, J)
C	nest of two parallel loops, iteration (i,j) will be executed on 
C	processor, which is owner of element A(i,j) 
            DO  1   J  =  1, L
            DO  1   I  =  1, L
                A(I,  J)  =  0.
                IF(I.EQ.1 .OR. J.EQ.1 .OR. I.EQ.L .OR. J.EQ.L) THEN
                      B(I,  J) = 0.
                ELSE
                      B(I,  J)  = ( 1. + I + J )
                ENDIF
    1       CONTINUE
        DO  2   IT  =  1,  ITMAX
                  EPS  =  0.
CDVM$   PARALLEL  (J,  I)   ON  A(I,  J),  REDUCTION ( MAX( EPS ))
C	variable EPS is used for calculation of maximum value
                  DO  21  J  =  2, L-1
                  DO  21  I  =  2, L-1
                         EPS = MAX ( EPS,  ABS( B( I, J)  -  A( I, J)))
                         A(I, J)  =  B(I, J)
   21             CONTINUE
CDVM$   PARALLEL  (J,  I)   ON  B(I,  J),   SHADOW_RENEW   (A)
C	Copying shadow elements of array A from 
C	neighboring processors before loop execution
                  DO  22  J = 2,  L-1
                  DO  22  I = 2,  L-1
        B(I, J) =  (A( I-1, J ) + A( I, J-1 ) + A( I+1, J)+
     *                        A( I, J+1 )) / 4
   22             CONTINUE
                  PRINT *,  'IT = ', IT,  '   EPS = ', EPS
                  IF ( EPS . LT . MAXEPS )    GO TO   3
    2   CONTINUE
    3   CONTINUE
C        OPEN (3,  FILE='JACOBI.DAT',  FORM='FORMATTED')
C        WRITE (3,*)   B
C        CLOSE (3)
        END

INTERVAL ( NLINE=8 SOURCE=jac.fdv ) LEVEL=0 EXE_COUNT=1

--- Main characteristics ---

Parallelization efficiency	0.4952
Execution time	3.2723
Processors	4
Total time	13.0891
*Productive time	6.4814	( CPU= 6.1880 Sys= 0.2897 I/O= 0.0037 )
*Lost time	6.6077
-- Insufficient parallelism	1.0294	( User= 0.1491 Sys= 0.8803 )
-- Communication	3.2100	( Real_sync= 3.1443 Starts= 0.0267 )
-- Idle time	2.3683
Load imbalance	0.1131
Synchronization	6.1627
Time variation	3.0460
Overlap	0.0004

	Nop	Communic	Real_sync	Synchro	Variation	Overlap
I/O	5	0.0000	0.0000	3.0368	3.0358	0.0000
Reduction	4	3.0644	3.0413	3.0130	3.0130	0.0001
Shadow	4	0.1189	0.1030	0.1129	0.0054	0.0003

Note: there are only non-distributed data print statements in the program, this fact explains the absence of input/output communication losses. As such data have the same value on each processor such operations are executed by input/output processor without interprocessor exchanges.

--- Comparative characteristics ---

	T min	Npr	T max	Npr	T mid
Lost time	1.6245	1	1.6636	2	1.6519
* User insufficient par.	0.0324	2	0.0408	3	0.0373
* Sys insufficient par.	0.2061	4	0.2346	3	0.2201
* Idle time	0.0000	1	0.0409	2	0.5921
* Communication	0.0781	4	1.3733	1	0.8025
Real synchronization	0.8025	4	1.3563	1	0.7861
Synchronization	0.0715	4	2.7026	1	1.5407
Variation	0.0104	4	1.3528	1	0.7615
Overlap	0.0000	2	0.0003	3	0.0001
Load imbalance	0.0000	1	0.0409	2	0.0283
Execution time	1.9373	4	3.2723	1	2.6802
User CPU time	1.5315	3	1.5768	1	1.5470
Sys CPU time	0.0678	4	0.0775	3	0.0724
I/O time	0.0001	1	0.0019	2	0.0009
Start operation	0.0036	4	0.0089	2	0.0067
Processors	4	1	4	4	4

	Communic	Real_sync	Synchro	Variation	Overlap
I/O Tmin I/O Tmax I/O Tmid	0.0000 1 0.0000 4 0.0000	0.0000 1 0.0000 4 0.0000	0.0058 4 1.3496 1 0.7592	0.0066 4 1.3494 1 0.7589	0.0000 1 0.0000 4 0.0000
Reduction Tmin Reduction Tmax Reduction Tmid	0.0303 4 1.2990 1 0.7661	0.0212 4 1.2920 1 0.7603	0.0231 4 1.2849 1 0.7532	0.0000 3 0.0025 4 0.0012	0.0000 2 0.0001 1 0.0000
Shadow Tmin Shadow Tmax Shadow Tmid	0.0029 3 0.0687 1 0.0297	0.0000 2 0.0643 1 0.0257	0.0000 3 0.0681 1 0.0282	0.0000 3 0.0022 2 0.0013	0.0000 1 0.0003 3 0.0001

Appendix. The list of characteristics

Main characteristics and their components

Efficiency coefficient (Parallelization efficiency) is ratio of productive time to total processor time.
Time of execution (Execution time).
The number of used processors (Processors).
Total processor time (Total time) is production of the time of execution (Execution_time) by the number of used processors (Processors).
Productive time (Productive time) is the sum of productive processor time (CPU), input/output time (I/O) and productive system time (Sys).
Lost time (Lost time).
Insufficient parallelism (Insufficient par) and its components.
Communications and all components (Communication).
Idle (Idle time).
Imbalance (Load Imbalance).
Potential synchronization losses and all components (Synchronization).
Potential time variation losses and all components (Time variation).
Overlapping time and all components (Overlap).

Characteristics of program execution on each processor

Lost time (Lost time) is the sum of insufficient parallelism losses (User Insufficient par), system insufficient parallelism losses (Sys Insufficient par), communications losses (Communication) and idle (Idle).
Insufficient parallelism losses (User insufficient par).
System insufficient parallelism losses (Sys insufficient par).
Time of losses because of the given processor idle (Idle time) is difference between maximal interval execution time (on any processor) and interval execution time on the given processor.
Total communication time (Communication).
Real time of losses because of dissynchronization (Real synchronization).
Potential time of losses because of dissynchronization (Synchronization).
Potential time of losses because of time variation (Variation).
Time of asynchronous operation overlapping (Overlap).
Losses because of load imbalance (Load Imbalance) is difference between maximal processor time (CPU + Sys) and the time on the given processor.
Time of interval execution (Execution time).
Productive processor time (User CPU time).
Productive system time (Sys CPU time).
Input/output time (I/O time).
Time of collective operation start (Start operation).
Number of processors used for interval (Processors).
Communication times for all types of collective operations (Reduction, Shadow, Remote access, Redistribution and I/O), besides times of their start.
Real dissynchronization losses for all types of collective operations.
Potential dissynchronization losses for all types of collective operations.
Potential time variation losses for all types of collective operations.
Time of overlapping for all collective operations (Overlap).