Scientific Computing
Have you ever been interested in optimizing a piece of code and looked for some solutions on the internet. I've read too many comments on StackOverflow bashing some poor fellow with the answers:
Premature optimization is the root of all evil
Use a library instead
But someone has to write the code for these libraries, and what if you're just interested in optimizing the code for fun. Perhaps, you find it interesting (like me) to try and get every last percent of performance out of your hardware. Here is a collection of good resources that I wish I knew about when I was starting out with optimization and performance engineering.
Benchmarking and Performance Engineering
Profilers and Benchmarking libraries
OpenMP
-
The OpenMP Common Core: Making OpenMP Simple Again - This book is an excellent introduction to OpenMP, but it also introduces some of the fundamental concepts of parallel programming in general (e.g. Almdahl's law and false sharing).
-
Using OpenMP - The Next Step - Discusses more advanced topics, goes deeper into hardware and performance. Personally I find the affinity and tasking parts most interesting.
-
High Performance Parallel Runtimes - Great if you want to understand more about how OpenMP/parallel runtimes works under the hood.
Cuda
Programming in Parallel with CUDA
A Practical Guide by Richard Ansorge
This book is a great introduction to CUDA for anyone accustomed to modern C++. It includes
modern CUDA topics such as cooperative groups and even touches on CUDA graphs.
Simply put, a must have!
General performance tuning resources, links and blogs.
Software Optimization Resources by Agner Fog.
easyperf.net Especially I'd recommend that you get Denis book "performance analysis and tuning on modern cpus", it's free!
gprofng I'm looking forward to the release of this profiling tool, when the gui is available. From what I've seen it look really promising.
How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures
Scalasc and ScoreP
e.g.General Rules On Presenting Data
https://htor.inf.ethz.ch/publications/img/hoefler-scientific-benchmarking.pdf
MPI
CUDA
Documentation and Useful Links
These are links to material/references on CUDA which I find useful:
CUDA types
Cuda contains predefined vector types with 1-4 variables of the following primitve types:
char
,
short
,
uint
,
int
,
long
,
ulong
,
longlong
,
ulonglong
,
float
, and
double
.
A cuda vector type can be defined for any of the above types by appending a number between
1-4 after the typename. The elements of the vector types are named x
, y
, z
, and w
,
thus a 4-vector vec
can access its 4 elements by vec.x
, vec.y
, vec.z
, vec.w
.
The constructor of the vector types
Alignment considerations for these types are found in the appendix B.3 of the CUDA C programming guide.
Grids, Threads and Warps
CUDA kernels are executed by threads which are grouped into different thread blocks. Thread blocks can at most consist of 1024 threads on modern NVIDIA GPUs. The reason for this is that all threads in a block must be executed on the same SMC (streaming multiprocessor core).
There are two parameters used when launching a CUDA kernel to configure the threads and thread
blocks used by the kernel. They are either of type int
or dim3
, where
struct uint3 { unsigned int x; unsigned int y; unsigned int z; }
struct dim3 { unsigned int x; unsigned int y; unsigned int z; }
The firs parameter specifies the number of thread blocks and also
Synchronized Memory Copies
cudaError_t cudaMemcpy (void* dst, const void* src, size_t count, cudaMemcpyKind kind)
// allocation
data_type* dA, hA;
hA = (data_type*) malloc(size);
cudaMalloc(&dA, size);
// device to host transfer
cudaMemcpy(dA, hA,size, cudaMemcpyHostToDevice);
// host to device transfer
cudaMemcpy(hA, dA,size, cudaMemcpyDeviceToHost);
Pinned Memory
Pinned memory is easiest created with the function cudaHostAlloc
is equivalent to cudaMallocHost
when called with the flag cudaHostAllocDefault
or 0 which is the value of the flag.
cudaError_t cudaHostAlloc(void** hostPointer, size_t size, unsigned int flags);
// example usage
cudaHostAlloc((void**) hostPointer, N*sizeof(double), cudaHostAllocDefault);
It is preferable to batch small arrays/variables into one larger pinned memory array for data transfers between device and host.
Streams and Asynchronous Memory Movement
It is possible to execute cuda kernels and memory data transfers asynchronously. Especially, parallel execution of cuda kernels and memory is possible wich CUDA streams.
Mdpsan and multidimensional arrays
C-style
Let \(A\) be a \(m\times n\) matrix (\(m\) rows and \(n\) columns).
There are two major storage layouts for dense matrices in memory:
row-major (C-style) array indexing of A stores the elements with rows at contiguous memory locations:
column-major (Fortran-style) array indexing of A stores the elements
Assume that \(A\) is a submatrix of a larger \(m_M \times n_M\) matrix \(M\). Then the indexing must take this into account. The leading dimension of \(A\) is defined as the column/row size of the super-matrix \(M\) if \(M\) is row/column-major.
mdspan
The multidimensioanl span, i.e. mdspan is part of C++ 23 standard library, however there is reference implementation which is part of Kokkos and available at github mdspan. Anyone who has worked with optimizing linear algebra kernels or simply used the BLAS api will understand the value of this.
References
- mdspan proposal
- Multidimensional C++ CppNorth 2022 talk by Bryce Adelstein Lelbach, the best mdspan overview/introduction I have found on how to actually use it.