Time_rep.html                   22-jul-1994  jms
Revised and reformatted to HTML 26 oct 1994  jms
Add SAMUS results               11 nov 1994  jms

RMS Parameters and Muon Level 1.5 Simulator Program Performance

  • Introduction
  • Discussion
  • Results for WAMONLY = .TRUE.
  • Results for WAMONLY = .FALSE.
  • J. Snow's Data
  • T. McMahon's Data
  • Plots
  • Introduction

    Purpose:

    To study the effect of setting RMS parameters on the behavior of running the muon level 1.5 simulator in the context of the D0USER program. Specifically the effects of varying multibuffer counts for sequential, relative, and index files, and of varying multiblock counts on the cpu time, direct and buffered I/O, and page faults for the program D0USER.

    Conditions:

    The program was run on the same sample of 12 events. The tests done by J. Snow were run on Vaxstation 4000-60's with the WAMONLY flag set to .true. and on Vaxstation 4000-90's with the WAMONLY flag set to .false. For the runs with WAMONLY set to .false., data were taken on several machines at different times. The data shown in the plots for a particular RMS configuration was chosen for the run with the shortest elapsed time thereby approximating as closely as possible conditions on a machine dedicated to this test. The tests done by T. McMahon were done on Vaxstation 3100 models M38 and M76 with WAMONLY set to .true.

    Additionally, a set of runs was done with WAMONLY set to .false., with the multibuffer count left at the default and multiblock count done at the default and lower values. This investigation was prompted by the results obtained from the tests above. Further, time per event for WAMONLY = .false. by running the program on just one event for both default and optimum multibuffer counts, and on 100 events for the optimum multibuffer setting.

    Discussion

    The results presented below are based on J. Snow's data. Plots of these data are available. T. McMahon's data qualitatively agree with the results presented below.

    Results:

    WAMONLY = .TRUE.

    1. Changing multibuffer counts for indexed and relative files have little to no effect on cpu time, direct or buffered I/O, page faults, or elapsed time. This is presumably because no significant I/O to indexed or relative files is done by D0USER.
    2. Changing multibuffer counts for sequential files has significant effect on direct I/O calls, no effect on buffered I/O calls, little effect on cpu time, a significant effect on page faulting, and a measureable effect upon elapsed time.
      1. As the buffer counts for sequential files increases, the number of direct I/O calls drops somewhat less rapidly than linearly until at a buffer count of 64 it levels off. Further increasing the buffer count results in no further decrease in direct I/O calls. At a buffer count somewhere between 100 and 128 process virtual address space is exhausted and the program crashes. Direct I/O calls decrease by 80% as the buffer count increases from 3 to 64.
      2. As the buffer counts for sequential files increases, cpu time rises slightly. There is a 10% rise in cpu time as the buffer count increases from 3 to 100.
      3. As the buffer counts for sequential files increases, page faults rise linearly. Page faults increase by 110% as the buffer count increases from 3 to 100.
      4. Elapsed time could not be reliably compared as the buffer counts were changed due to activity of other processes on the machine. However it was observed that elapsed time decreased monotonically a total of 50% as the buffer count increased from 3 to 48. This is resonable considering that with more buffers the program is doing less I/O calls and therefore waiting for resources less often. However working against this trend is the increase in page faults for which the program again must wait for resources. Increasing the buffer count from 3 to 64, while decreasing direct I/O calls by 80%, increases page faults by 60%. Elapsed time rises from the minimum at a buffer count of 48 to a plateau at a buffer count of 64 which is 62% of the value at a buffer count of 3.
    3. Changing multiblock counts has significant effect on direct I/O calls and page faults. There was no significant change in cpu time as the multiblock count increased from 16 to its maximum of 127. Page faults increased by 20% over this range, while direct I/O calls decreased by two-thirds.
    4. Increasing the multiblock count while holding the multibuffer count constant causes a monotonic increase in elapsed time. Elapsed time increased 86% as the multiblock count increased from 16 to 127.
    5. Setting multiblock count to 64 and multibuffer count to 64 causes the program to fail with the RMS message dynamic memory exhausted. The program will run with both counts set to 32. This setting results in direct I/O at a level just 10% of the value using system defaults (which are multiblock count = 16 and sequential multibuffer count = 3). Page faults are 50% higher than using the system defaults.
    6. Setting multiblock counts to twice the value of the default and increasing the multibuffer count shows the same qualitative behavior seen when the block count is at the system default and the buffer count is increased, however elapsed time is smaller for the default value of block count at the same value of buffer count.
    7. Setting multibuffer counts to 32 and increasing the multiblock count shows the same qualitative behavior seen when the buffer count is at the system default and the block count is increased. However elapsed time is smaller for the default value of block count at the same value of buffer count.

    WAMONLY = .FALSE.

    1. Changing multibuffer counts for indexed and relative files have little to no effect on cpu time, direct or buffered I/O, page faults, or elapsed time. This is presumably because no significant I/O to indexed or relative files is done by D0USER.
    2. Changing multibuffer counts for sequential files has significant effect on direct I/O calls, cpu time, page faulting, and elapsed time, and no effect on buffered I/O calls.
      1. As the buffer counts for sequential files increases, the number of direct I/O calls drops precipitously until at a buffer count of 48 it starts to slowly decrease. At a buffer count somewhere between 100 and 128 process virtual address space is exhausted and the program crashes. Direct I/O calls decrease by 98% as the buffer count increases from 3 to 100.
      2. As the buffer counts for sequential files increases, cpu time drops quickly reaching a plateau at a buffer count of about 32. There is a 60% drop in cpu time as the buffer count increases from 3 to 100.
      3. As the buffer counts for sequential files increases, page faults rise nearly linearly. Page faults increase by 60% as the buffer count increases from 3 to 100.
      4. Elapsed time dropped by 97% as the multibuffer count increased from the default 3 to 100.
    3. Changing multiblock counts has significant effect on elapsed time. There was little significant change in cpu time, page faults, and direct I/O counts as the multiblock count increased from 16 to its maximum of 127. Direct I/O calls decreased by 20%, cpu time increased 20%, and page faults held about steady.
      1. Increasing the multiblock count while holding the multibuffer count constant causes a monotonic increase in elapsed time. Elapsed time increased enormously, over a factor of 5, when the multiblock count increased from 16 to 127.
      2. Decreasing multiblock count below the default value, to as low as 3, results in increasing cpu time (30%), decreasing page faults (67%), greatly increased direct I/O operations (35%), and greatly increased elapsed time (86%). Multiblock count should be left at the default value.
    4. The start up time for the program was found by running on just one event for both default and optimum multibuffer count. CPU time was nearly identical, but direct I/O was over 80% lower and page faults over 85% higher for the optimized buffer count compared to the default setting. Interestingly elapsed time was same for program startup within 2%.
    5. Time per event was calculated for the default case by subtracting the elapsed time for 1 event from that for 12 events and dividing by 11. For the optimum setting the time for 1 event was subtracted from the time for 100 events and dividing by 99. For the default case per event time was 5.5 sec/event, while for the optimum setting (multibuffer count = 100), per event time was 0.60 sec/event, a factor of over 9 decrease in execution time.

    Conclusions:

    For WAMONLY = .true. increasing both multiblock counts and sequential multibuffer counts are effective in reducing direct I/O calls thereby reducing resource waits. However increasing these counts leaves less available virtual address space and dynamic memory for the process running the program resulting in greater page faulting. These effects work against each other. Given that the rise in page faulting is linear over the range of allowable setting and that increasing the multibuffer and multiblock counts is most effective at decreasing direct I/O calls at lower values of those counts (i.e. slope of I/O calls vs parameter counts are most negative at small values of the parameters), minimizing I/O resource waits would occur at multibuffer and multiblock counts of around 32. However it is observed that elasped time is minimized at a multibuffer count of 48, and that increasing the multiblock count only results in increased elapsed time, while using up needed virtual address space.

    For WAMONLY = .false. increasing and decreasing multiblock counts from the default value results in much longer program execution, and should be avoided. However increasing multibuffer counts is extremely effective in reducing direct I/O calls with the consequence of greatly reducing program execution time. Increasing the the multibuffer count to 100 increases the speed of execution of the program by a factor of from somewhere between 9 to over 20!

    It would appear that a speed up in execution of the simulator for SAMUS enabled by a factor of over an order of magnitude can be obtained by a simple DCL command before program execution to increase the sequential multibuffer count from the default value of 3 to a value of 100. For WAMUS only a factor of two increase in speed is all that can be expected.

    McMahon's data support these points although the tests were done on different machines. The tests done on the Vaxstation 3100 Model M76 how a decrease of a factor of 2 in elapsed time when the buffer count is increased from 3 to only 16. This is conceivably due to the different process quotas in effect on that machine. Apparently process quotas and RMS parameters interact in some way to affect program performance.

    It is recommended that users of the simulator set the multibuffer count to 100 before executing the program. This should result in a speed up of over an order of magnitude in execution. Included below is a summary of the data of the timing tests, with pointers to the complete data. Also included is a display of the characteristics of the process that ran these tests on the 4000-60's. It can be seen that the virtual size of the process was maxed out at 60 Mb by asking for too many buffers.

    J. Snow's Data

    The raw data for WAMONLY = .true. are summarized in Table 1. The raw data for WAMONLY = .false. are summarized in Table 2. Following the tables is a display of the process quotas and accounting for the tests.
    
    Table 1 WAMONLY = .TRUE.
    
    M.Block   M.Buffer   Elapsed     CPU    Direct   Buffered   Page
    Counts     Counts     (sec.)   (10ms.)    IO        IO      Faults
    
      16         3         223       4334    6285      250      23086
      16        16         193       4358    4361      239      27463
      16        32         142       4320    2546      239      30689
      16        48         114       4326    1542      241      36645
      16        64         142       4536    1197      239      38391
      16        82         141       4564    1196      241      42043
      16       100         142       4752    1204      249      48596
      16       128         virtual address space full
      32         3         317       4422    4728      232      24890
      64         3         375       4376    3330      229      26420
     127         3         415       4441    2113      228      28489
      32        32         171       4272     701      231      35496
     127       255         dynamic memory exhausted
      64       100         dynamic memory exhausted
      32       100         dynamic memory exhausted
    

    
    Table 2 WAMONLY = .FALSE.
    
    M.Block   M.Buffer   Elapsed     CPU    Direct   Buffered   Page
    Counts     Counts     (sec.)   (10ms.)    IO        IO      Faults   Node
    
      16         3         2324     10087   141092     469      27851    D0BR06
      16        16         1317      5745    60156     461      23481    D0CD03
      16        32          423      4176    16939     461      27267    D0CD03
      16        48          259      3981     6036     461      34241    D0BR06
      16        64          142      3825     4211     461      36864    D0CD03
      16        82          167      4064     2883     461      38253    D0BR06
      16       100           97      3923     1965     461      43212    D0CD03
      16       128         virtual address space full                    D0CD03
      32         3         4610      9092   129564     454      22199    D0CD03
      64         3         7652     10188   121780     451      31268    D0CD03
     127         3        12070     11843   114135     450      26597    D0CD03
     127       255         dynamic memory exhausted                      D0BR06
      64       100         dynamic memory exhausted                      D0BR06
      32       100         dynamic memory exhausted                      D0BR06
      16         3         4198      9261   141084     469      27318    D0UM05
       8         3         5664      9583   158536     479      22418    D0UM05
       3         3         7805     12242   190244     534      18063    D0UM05
    


    Process Quotas and Accounting

    Below is a display of the Show Process command output for the process running the tests on the 4000-60's for WAMONLY = .true. Note that peak virtual size is 60Mb, the maximum allowed.
    Terminal:           
    User Identifier:    [UPGRADE,SNOW]
    Base priority:      4
    Default file spec:  USR$ROOT4:[SNOW.TIMING]
    
    Process Quotas:
     Account name: UPGRADE 
     CPU limit:                      Infinite  Direct I/O limit:       200
     Buffered I/O byte count quota:    523184  Buffered I/O limit:     200
     Timer queue entry quota:            2047  Open file quota:       2028
     Paging file quota:                114287  Subprocess quota:       255
     Default page fault cluster:           16  AST quota:             2047
     Enqueue quota:                      2048  Shared file limit:        0
     Max detached processes:                0  Max active jobs:          0
    
    Accounting information:
     Buffered I/O count:      2128  Peak working set size:       8192
     Direct I/O count:        1943  Peak virtual size:         120000
     Page faults:            89817  Mounted volumes:                0
     Images activated:          26
     Elapsed CPU time:          0 00:01:12.50
     Connect time:              0 00:03:52.20
     
    Process privileges:
     TMPMBX               may create temporary mailbox
     NETMBX               may create network device
     
    Process rights:
     INTERACTIVE                       
     LOCAL                             
     D0CMS                             resource
     D0FAT_PRJ                         resource
     FATMEN_PRESTAGE                   resource
     D0FS_ACCESS                       resource
    
    Process Dynamic Memory Area  
        Current Size (bytes)        768000    Current Total Size (pages)    1500
        Free Space (bytes)          719592    Space in Use (bytes)         48408
        Size of Largest Block       719040    Size of Smallest Block           8
        Number of Free Blocks           20    Free Blocks LEQU 32 Bytes       13
    
    There are 2 processes in this job: 
    
      rhizomorphic$1
        SNOW_1 (*)
    

    T. McMahon's Data

           block  buffer  buffer  buffer      total      cpu      I/O calls  test
                  index.   rel.    seq.                                       #
    
    M38 workstation !!!!!
    
    wamus   16      3       3       3          320       124         6428    (7
    wamus   32      3       3       3          434       121         5072   (11
    wamus  126      3       3       3          611       125         2389   (12
    
    wamus   32      3      16      16          361       121         2474   (10
    wamus   32      3      16      16          354       120         2379   (16
     (the two above are same conditions, done at different times)
    
    wamus   32     16       3       3          786       121         4971   (13
    wamus   32      3      16       3          748       121         4968   (14
    wamus   32      3       3      16          447       120         2379   (15
    
    all     16      3       3       3         4602       429       141609    (8
    all     32      3      16      16         1942       271        33397    (9    
    
    M76 workstation !!!!!
    
    wamus   32      3      16      16          252        75         2446   (17
    wamus   32      3      16       3          550        80         4970   (18
    wamus   32      3       3      16          257        79         2449   (19
    wamus   32      3       3       8          356        79         4006   (20
    wamus   64      3       3       8          277        76         2180   (21
    

    Plots

    Below is a list of hyperlinks to plots of Snow's data. The plots show page faults, direct IO, CPU time, and elapsed time as functions of multiblock and multibuffer counts for WAMONLY true and false.

  • WAMONLY = FALSE: Performance as a function of buffer count, block count fixed at default.
  • WAMONLY = FALSE: Performance as a function of block count, buffer count fixed at default.
  • WAMONLY = TRUE: Performance as a function of buffer count, block count fixed at default.
  • WAMONLY = TRUE: Performance as a function of block count, buffer count fixed at default.
  • WAMONLY = TRUE: Performance as a function of buffer count, block count fixed at 32.
  • WAMONLY = TRUE: Performance as a function of block count, buffer count fixed at 32.

  • Joel Snow
    26 October 1994