Timing for serial
[crnc25@login1.ham8 cw2]$ cat slurm-15759203.out
This should be run on Hamilton!
Command being timed: "./serial"
User time (seconds): 128.96
System time (seconds): 0.01
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:09.87
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 17928
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 32904
Voluntary context switches: 12
Involuntary context switches: 331
Swaps: 0
File system inputs: 80
File system outputs: 48
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
[crnc25@login1.ham8 cw2]$ tail -f slurm-15759296.out
This should be run on Hamilton!
Command being timed: "./serial"
User time (seconds): 130.77
System time (seconds): 0.01
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:11.72
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 18104
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 1
Minor (reclaiming a frame) page faults: 32909
Voluntary context switches: 11
Involuntary context switches: 363
Swaps: 0
File system inputs: 80
File system outputs: 48
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status:
INIT
- Each grid point
(i,j)is computed independently - No data dependencies between iterations
- ensures each thread has its own copy of the index variable
schedule(static)gives predictable, contiguous chunks → good cache locality GRAD- Reads from
u,varrays (shared, read-only) - Writes to
du,dvarrays (each thread writes to different locations) - No write conflicts because
du[idx]anddv[idx]are unique per(i,j) - The stencil pattern (5-point Laplacian) only reads neighbours — no race conditions
Key insight: This is the Jacobi iteration pattern — we read from the old state (u, v) and write to a separate array (du, dv). This naturally avoids data races.
STEP
- Each
u[idx]andv[idx]is updated independently - No cross-index dependencies
- Classic embarrassingly parallel pattern
============================================================ OPENMP SCALING ANALYSIS
Threads Time(s) Speedup Efficiency
1 103.750 1.00x 100.0%
2 55.150 1.88x 94.1%
4 30.110 3.45x 86.1%
8 13.150 7.89x 98.6%
16 6.990 14.84x 92.8%
32 4.410 23.53x 73.5%
64 4.100 25.30x 39.5%
128 5.750 18.04x 14.1%
ANALYSIS:
- At 64 threads: 25.3x speedup (39.5% efficiency)
- At 128 threads: 18.0x speedup (14.1% efficiency)
- WARNING: Significant efficiency drop at 64 threads This suggests cross-socket NUMA effects
- Consider using OMP_PROC_BIND=spread for better memory distribution
This is clear to see when we look at the thread and cache topology of Hamilton
CPU name: AMD EPYC 7702 64-Core Processor
CPU type: AMD K17 (Zen2) architecture
CPU stepping: 0
Hardware Thread Topology
Sockets: 2 Cores per socket: 64 Threads per core: 1
p. 171 Principles of Parallel Scientific Computing [Tobias Weinzierl]
============================================================ OPENMP SCALING ANALYSIS
Threads Time(s) Speedup Efficiency
1 103.780 1.00x 100.0%
2 50.910 2.04x 101.9%
4 26.170 3.97x 99.1%
8 13.640 7.61x 95.1%
16 12.730 8.15x 51.0%
32 10.630 9.76x 30.5%
64 7.210 14.39x 22.5%
128 8.290 12.52x 9.8%
ANALYSIS:
- At 64 threads: 14.4x speedup (22.5% efficiency)
- At 128 threads: 12.5x speedup (9.8% efficiency)
- WARNING: Significant efficiency drop at 64 threads This suggests cross-socket NUMA effects Consider using OMP_PROC_BIND=spread for better memory distribution