Time_rep.html 22-jul-1994 jms
Revised and reformatted to HTML 26 oct 1994 jms
Add SAMUS results 11 nov 1994 jms
RMS Parameters and Muon Level 1.5 Simulator Program Performance
Introduction
Purpose:
To study the effect of setting RMS parameters on the behavior of
running the muon level 1.5 simulator in the context of the D0USER program.
Specifically the effects of varying multibuffer
counts for sequential, relative, and index files, and of varying multiblock
counts on the cpu time, direct and buffered I/O, and page faults for the program
D0USER.
Conditions:
The program was run on the same sample of 12 events.
The tests done by J. Snow were run on Vaxstation 4000-60's with
the WAMONLY flag set to .true. and on Vaxstation 4000-90's with the WAMONLY
flag set to .false. For the runs with WAMONLY set to .false., data were taken
on several machines at different times. The data shown in the plots for a
particular RMS configuration was chosen for the run with the shortest elapsed
time thereby approximating as closely as possible conditions on a machine
dedicated to this test.
The tests done by T. McMahon were done on Vaxstation 3100 models M38 and
M76 with WAMONLY set to .true.
Additionally, a set of runs was done with WAMONLY set to .false., with the
multibuffer count left at the default and multiblock count done at the default
and lower values. This investigation was prompted by the results obtained from
the tests above. Further, time per event for WAMONLY = .false. by running the
program on just one event for both default and optimum multibuffer counts, and
on 100 events for the optimum multibuffer setting.
Discussion
The results presented below are based on J. Snow's data.
Plots of these data are available. T.
McMahon's data qualitatively agree with the results presented below.
Results:
WAMONLY = .TRUE.
- Changing multibuffer counts for indexed and relative files have little to no
effect on cpu time, direct or buffered I/O, page faults, or elapsed time.
This is presumably
because no significant I/O to indexed or relative files is done by D0USER.
- Changing
multibuffer counts for sequential files has significant effect on direct
I/O calls, no effect on buffered I/O calls, little effect on cpu
time, a significant effect on page faulting, and a measureable effect upon
elapsed time.
- As the buffer counts for sequential files increases, the
number of direct I/O calls drops somewhat less rapidly than linearly until at
a buffer count of 64 it levels off. Further increasing the buffer count
results in no further decrease in direct I/O calls. At a buffer count somewhere
between 100 and 128 process virtual address space is exhausted and the program
crashes. Direct I/O calls decrease by 80% as the buffer count increases from 3
to 64.
- As the buffer counts for sequential files increases, cpu
time rises slightly. There is a 10% rise in cpu time as
the buffer count increases from 3 to 100.
- As the buffer counts for sequential files increases, page
faults rise linearly. Page faults increase by 110% as the buffer count
increases from 3 to 100.
- Elapsed time could not be reliably compared as the
buffer counts were changed due to activity of other processes on the machine.
However it was observed that elapsed time decreased monotonically a total of
50% as the buffer count increased from 3 to 48. This is resonable considering
that with more buffers the program is doing less I/O calls and therefore waiting
for resources less often. However working against this trend is the increase
in page faults for which the program again must wait for resources. Increasing
the buffer count from 3 to 64, while decreasing direct I/O calls by 80%,
increases page faults by 60%. Elapsed time rises from the minimum at a buffer
count of 48 to a plateau at a buffer count of 64 which is 62% of the value at
a buffer count of 3.
- Changing
multiblock counts has significant effect on direct I/O calls
and page faults. There was no significant change in cpu time as the multiblock
count increased from 16 to its maximum of 127. Page faults increased
by 20% over this range, while direct I/O calls decreased by two-thirds.
- Increasing the multiblock count while holding the multibuffer count
constant causes a monotonic increase in elapsed time. Elapsed time increased
86% as the multiblock count increased from 16 to 127.
- Setting multiblock count to 64 and multibuffer count to 64 causes
the program to fail with the RMS message dynamic memory exhausted. The program
will run with both counts set to 32. This setting results in direct I/O at a
level just 10% of the value using system defaults (which are multiblock count =
16 and sequential multibuffer count = 3). Page faults are 50% higher than
using the system defaults.
- Setting
multiblock counts to twice the value of the default and increasing
the multibuffer count shows the same qualitative behavior seen when the block
count is at the system default and the buffer count is increased, however
elapsed time is smaller for the default value of block count at the same value
of buffer count.
- Setting
multibuffer counts to 32 and increasing
the multiblock count shows the same qualitative behavior seen when the buffer
count is at the system default and the block count is increased. However
elapsed time is smaller for the default value of block count at the same value
of buffer count.
WAMONLY = .FALSE.
- Changing multibuffer counts for indexed and relative files have little to no
effect on cpu time, direct or buffered I/O, page faults, or elapsed time.
This is presumably
because no significant I/O to indexed or relative files is done by D0USER.
- Changing
multibuffer counts for sequential files has significant effect on direct
I/O calls, cpu time, page faulting, and elapsed time, and no effect on
buffered I/O calls.
- As the buffer counts for sequential files increases, the
number of direct I/O calls drops precipitously until at
a buffer count of 48 it starts to slowly decrease. At a buffer count somewhere
between 100 and 128 process virtual address space is exhausted and the program
crashes. Direct I/O calls decrease by 98% as the buffer count increases from 3
to 100.
- As the buffer counts for sequential files increases, cpu
time drops quickly reaching a plateau at a buffer count of about 32.
There is a 60% drop in cpu time as the buffer count increases from 3 to 100.
- As the buffer counts for sequential files increases, page
faults rise nearly linearly. Page faults increase by 60% as the buffer count
increases from 3 to 100.
- Elapsed time dropped by 97% as the multibuffer count increased from the
default 3 to 100.
- Changing
multiblock counts has significant effect on elapsed time.
There was little significant change in cpu time, page faults, and direct I/O
counts as the multiblock
count increased from 16 to its maximum of 127. Direct I/O calls decreased
by 20%, cpu time increased 20%, and page faults held about steady.
- Increasing the multiblock count while holding the multibuffer count
constant causes a monotonic increase in elapsed time. Elapsed time increased
enormously, over a factor of 5, when the multiblock
count increased from 16 to 127.
- Decreasing multiblock count below the default value, to as low as 3,
results in increasing cpu time (30%), decreasing page faults (67%), greatly
increased direct I/O operations (35%), and greatly increased elapsed time
(86%). Multiblock count should be left at the default value.
- The start up time for the program was found by running on just one event
for both default and optimum multibuffer count. CPU time was nearly identical,
but direct I/O was over 80% lower and page faults over 85% higher for the
optimized buffer count compared to the default setting. Interestingly elapsed
time was same for program startup within 2%.
- Time per event was calculated
for the default case by subtracting the elapsed time for 1 event from that for
12 events and dividing by 11. For the optimum setting the time for 1 event was
subtracted from the time for 100 events and dividing by 99. For the default
case per event time was 5.5 sec/event, while for the optimum setting
(multibuffer count = 100), per event time was 0.60 sec/event, a factor
of over 9 decrease in execution time.
Conclusions:
For WAMONLY = .true. increasing both multiblock counts and sequential multibuffer counts are
effective in reducing direct I/O calls thereby reducing resource waits. However
increasing these counts leaves less available virtual address space and dynamic
memory for the process running the program resulting in greater page faulting.
These effects work against each other. Given that the rise in page faulting is
linear over the range of allowable setting and that increasing the multibuffer
and multiblock counts is most effective at decreasing direct I/O calls at lower
values of those counts (i.e. slope of I/O calls vs parameter counts are most
negative at small values of the parameters), minimizing I/O resource waits would
occur at multibuffer and multiblock counts of around 32. However it is
observed that elasped time is minimized at a multibuffer count of 48, and that
increasing the multiblock count only results in increased elapsed time, while
using up needed virtual address space.
For WAMONLY = .false. increasing and decreasing multiblock counts from the
default value results in much longer
program execution, and should be avoided. However increasing multibuffer
counts is extremely effective in reducing direct I/O calls with the consequence
of greatly reducing program execution time. Increasing the the multibuffer
count to 100 increases the speed of execution of the program by a factor of
from somewhere between 9 to over 20!
It would appear that a speed up in execution of the simulator for SAMUS
enabled by a factor of over an order of magnitude can be obtained by a
simple DCL command before program execution to increase the sequential
multibuffer count from the default value of 3 to a value of 100. For WAMUS
only a factor of two increase in speed is all that can be expected.
McMahon's data support
these points although the tests were done on different machines. The tests done
on the Vaxstation 3100 Model M76 how a decrease of a factor of 2 in elapsed
time when the buffer count is increased from 3 to only 16. This is conceivably
due to the different process quotas in effect on that machine. Apparently
process quotas and RMS parameters interact in some way to affect program
performance.
It is recommended that users of the simulator set the multibuffer count to
100 before executing the program. This should result in a speed up of over an
order of magnitude in execution.
Included below is a summary of the data of the timing tests, with pointers to
the complete data.
Also included is a display of the characteristics of the process that ran these tests
on the 4000-60's. It can be seen that the virtual size of the process was maxed
out at 60 Mb by asking for too many buffers.
J. Snow's Data
The raw data for
WAMONLY = .true. are summarized in Table 1.
The raw data for
WAMONLY = .false. are summarized in Table 2.
Following the tables is a display of the process quotas
and accounting for the tests.
Table 1 WAMONLY = .TRUE.
M.Block M.Buffer Elapsed CPU Direct Buffered Page
Counts Counts (sec.) (10ms.) IO IO Faults
16 3 223 4334 6285 250 23086
16 16 193 4358 4361 239 27463
16 32 142 4320 2546 239 30689
16 48 114 4326 1542 241 36645
16 64 142 4536 1197 239 38391
16 82 141 4564 1196 241 42043
16 100 142 4752 1204 249 48596
16 128 virtual address space full
32 3 317 4422 4728 232 24890
64 3 375 4376 3330 229 26420
127 3 415 4441 2113 228 28489
32 32 171 4272 701 231 35496
127 255 dynamic memory exhausted
64 100 dynamic memory exhausted
32 100 dynamic memory exhausted
Table 2 WAMONLY = .FALSE.
M.Block M.Buffer Elapsed CPU Direct Buffered Page
Counts Counts (sec.) (10ms.) IO IO Faults Node
16 3 2324 10087 141092 469 27851 D0BR06
16 16 1317 5745 60156 461 23481 D0CD03
16 32 423 4176 16939 461 27267 D0CD03
16 48 259 3981 6036 461 34241 D0BR06
16 64 142 3825 4211 461 36864 D0CD03
16 82 167 4064 2883 461 38253 D0BR06
16 100 97 3923 1965 461 43212 D0CD03
16 128 virtual address space full D0CD03
32 3 4610 9092 129564 454 22199 D0CD03
64 3 7652 10188 121780 451 31268 D0CD03
127 3 12070 11843 114135 450 26597 D0CD03
127 255 dynamic memory exhausted D0BR06
64 100 dynamic memory exhausted D0BR06
32 100 dynamic memory exhausted D0BR06
16 3 4198 9261 141084 469 27318 D0UM05
8 3 5664 9583 158536 479 22418 D0UM05
3 3 7805 12242 190244 534 18063 D0UM05
Below is a display of the Show Process command output for the process running
the tests on the 4000-60's for WAMONLY = .true. Note that peak virtual size is
60Mb, the maximum allowed.
Terminal:
User Identifier: [UPGRADE,SNOW]
Base priority: 4
Default file spec: USR$ROOT4:[SNOW.TIMING]
Process Quotas:
Account name: UPGRADE
CPU limit: Infinite Direct I/O limit: 200
Buffered I/O byte count quota: 523184 Buffered I/O limit: 200
Timer queue entry quota: 2047 Open file quota: 2028
Paging file quota: 114287 Subprocess quota: 255
Default page fault cluster: 16 AST quota: 2047
Enqueue quota: 2048 Shared file limit: 0
Max detached processes: 0 Max active jobs: 0
Accounting information:
Buffered I/O count: 2128 Peak working set size: 8192
Direct I/O count: 1943 Peak virtual size: 120000
Page faults: 89817 Mounted volumes: 0
Images activated: 26
Elapsed CPU time: 0 00:01:12.50
Connect time: 0 00:03:52.20
Process privileges:
TMPMBX may create temporary mailbox
NETMBX may create network device
Process rights:
INTERACTIVE
LOCAL
D0CMS resource
D0FAT_PRJ resource
FATMEN_PRESTAGE resource
D0FS_ACCESS resource
Process Dynamic Memory Area
Current Size (bytes) 768000 Current Total Size (pages) 1500
Free Space (bytes) 719592 Space in Use (bytes) 48408
Size of Largest Block 719040 Size of Smallest Block 8
Number of Free Blocks 20 Free Blocks LEQU 32 Bytes 13
There are 2 processes in this job:
rhizomorphic$1
SNOW_1 (*)
T. McMahon's Data
block buffer buffer buffer total cpu I/O calls test
index. rel. seq. #
M38 workstation !!!!!
wamus 16 3 3 3 320 124 6428 (7
wamus 32 3 3 3 434 121 5072 (11
wamus 126 3 3 3 611 125 2389 (12
wamus 32 3 16 16 361 121 2474 (10
wamus 32 3 16 16 354 120 2379 (16
(the two above are same conditions, done at different times)
wamus 32 16 3 3 786 121 4971 (13
wamus 32 3 16 3 748 121 4968 (14
wamus 32 3 3 16 447 120 2379 (15
all 16 3 3 3 4602 429 141609 (8
all 32 3 16 16 1942 271 33397 (9
M76 workstation !!!!!
wamus 32 3 16 16 252 75 2446 (17
wamus 32 3 16 3 550 80 4970 (18
wamus 32 3 3 16 257 79 2449 (19
wamus 32 3 3 8 356 79 4006 (20
wamus 64 3 3 8 277 76 2180 (21
Plots
Below is a list of hyperlinks to plots of Snow's data. The plots show page
faults, direct IO, CPU time, and elapsed time as functions of multiblock and
multibuffer counts for WAMONLY true and false.
WAMONLY = FALSE: Performance
as a function of buffer count, block count fixed at default.
WAMONLY = FALSE: Performance
as a function of block count, buffer count fixed at default.
WAMONLY = TRUE: Performance
as a function of buffer count, block count fixed at default.
WAMONLY = TRUE: Performance
as a function of block count, buffer count fixed at default.
WAMONLY = TRUE: Performance
as a function of buffer count, block count fixed at 32.
WAMONLY = TRUE: Performance
as a function of block count, buffer count fixed at 32.
Joel Snow
26 October 1994