Timing for serial

[crnc25@login1.ham8 cw2]$ cat slurm-15759203.out 
This should be run on Hamilton!
	Command being timed: "./serial"
	User time (seconds): 128.96
	System time (seconds): 0.01
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 2:09.87
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 17928
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1
	Minor (reclaiming a frame) page faults: 32904
	Voluntary context switches: 12
	Involuntary context switches: 331
	Swaps: 0
	File system inputs: 80
	File system outputs: 48
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

[crnc25@login1.ham8 cw2]$ tail -f slurm-15759296.out 
This should be run on Hamilton!
	Command being timed: "./serial"
	User time (seconds): 130.77
	System time (seconds): 0.01
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 2:11.72
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 18104
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 1
	Minor (reclaiming a frame) page faults: 32909
	Voluntary context switches: 11
	Involuntary context switches: 363
	Swaps: 0
	File system inputs: 80
	File system outputs: 48
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status:

INIT

Each grid point (i,j) is computed independently
No data dependencies between iterations
ensures each thread has its own copy of the index variable
schedule(static) gives predictable, contiguous chunks → good cache locality GRAD
Reads from u, v arrays (shared, read-only)
Writes to du, dv arrays (each thread writes to different locations)
No write conflicts because du[idx] and dv[idx] are unique per (i,j)
The stencil pattern (5-point Laplacian) only reads neighbours — no race conditions

Key insight: This is the Jacobi iteration pattern — we read from the old state (u, v) and write to a separate array (du, dv). This naturally avoids data races.

STEP

Each u[idx] and v[idx] is updated independently
No cross-index dependencies
Classic embarrassingly parallel pattern

============================================================ OPENMP SCALING ANALYSIS

Threads Time(s) Speedup Efficiency

   1    103.750       1.00x       100.0%
   2     55.150       1.88x        94.1%
   4     30.110       3.45x        86.1%
   8     13.150       7.89x        98.6%
  16      6.990      14.84x        92.8%
  32      4.410      23.53x        73.5%
  64      4.100      25.30x        39.5%
 128      5.750      18.04x        14.1%

ANALYSIS:

At 64 threads: 25.3x speedup (39.5% efficiency)
At 128 threads: 18.0x speedup (14.1% efficiency)
WARNING: Significant efficiency drop at 64 threads This suggests cross-socket NUMA effects
Consider using OMP_PROC_BIND=spread for better memory distribution

This is clear to see when we look at the thread and cache topology of Hamilton CPU name: AMD EPYC 7702 64-Core Processor
CPU type: AMD K17 (Zen2) architecture CPU stepping: 0

Hardware Thread Topology

Sockets: 2 Cores per socket: 64 Threads per core: 1

p. 171 Principles of Parallel Scientific Computing [Tobias Weinzierl]

============================================================ OPENMP SCALING ANALYSIS

Threads Time(s) Speedup Efficiency

   1    103.780       1.00x       100.0%
   2     50.910       2.04x       101.9%
   4     26.170       3.97x        99.1%
   8     13.640       7.61x        95.1%
  16     12.730       8.15x        51.0%
  32     10.630       9.76x        30.5%
  64      7.210      14.39x        22.5%
 128      8.290      12.52x         9.8%

ANALYSIS:

At 64 threads: 14.4x speedup (22.5% efficiency)
At 128 threads: 12.5x speedup (9.8% efficiency)
WARNING: Significant efficiency drop at 64 threads This suggests cross-socket NUMA effects Consider using OMP_PROC_BIND=spread for better memory distribution

Knowledge Garden

Explorer

notes

============================================================ OPENMP SCALING ANALYSIS

Threads Time(s) Speedup Efficiency

============================================================ OPENMP SCALING ANALYSIS

Threads Time(s) Speedup Efficiency

Graph View

Table of Contents

Backlinks