Scientific Computing

Have you ever been interested in optimizing a piece of code and looked for some solutions on the internet. I've read too many comments on StackOverflow bashing some poor fellow with the answers:

Premature optimization is the root of all evil

Use a library instead

But someone has to write the code for these libraries, and what if you're just interested in optimizing the code for fun. Perhaps, you find it interesting (like me) to try and get every last percent of performance out of your hardware. Here is a collection of good resources that I wish I knew about when I was starting out with optimization and performance engineering.

Benchmarking and Performance Engineering

Profilers and Benchmarking libraries

Google Benchmark

gprofng

OpenMP

The OpenMP Common Core: Making OpenMP Simple Again - This book is an excellent introduction to OpenMP, but it also introduces some of the fundamental concepts of parallel programming in general (e.g. Almdahl's law and false sharing).
Using OpenMP - The Next Step - Discusses more advanced topics, goes deeper into hardware and performance. Personally I find the affinity and tasking parts most interesting.
High Performance Parallel Runtimes - Great if you want to understand more about how OpenMP/parallel runtimes works under the hood.

OpenMP Reference Card

Cuda

Programming in Parallel with CUDA A Practical Guide by Richard Ansorge
This book is a great introduction to CUDA for anyone accustomed to modern C++. It includes modern CUDA topics such as cooperative groups and even touches on CUDA graphs. Simply put, a must have!

General performance tuning resources, links and blogs.

Software Optimization Resources by Agner Fog.

easyperf.net Especially I'd recommend that you get Denis book "performance analysis and tuning on modern cpus", it's free!

Johnny's Software Lab

gprofng I'm looking forward to the release of this profiling tool, when the gui is available. From what I've seen it look really promising.

How to Benchmark Code Execution Times on Intel IA-32 and IA-64 Instruction Set Architectures

Scalasc and ScoreP

scorep [compiler_command]

e.g.

scorep g++ main.cpp -o a.out

General Rules On Presenting Data

https://htor.inf.ethz.ch/publications/img/hoefler-scientific-benchmarking.pdf

MPI

MPICH documentation

CUDA

Documentation and Useful Links

These are links to material/references on CUDA which I find useful:

CUDA types

Cuda contains predefined vector types with 1-4 variables of the following primitve types: char, short, uint, int, long, ulong, longlong, ulonglong, float, and double.

A cuda vector type can be defined for any of the above types by appending a number between 1-4 after the typename. The elements of the vector types are named x, y, z, and w, thus a 4-vector vec can access its 4 elements by vec.x, vec.y, vec.z, vec.w.

The constructor of the vector types

uint3, uint_vector

Alignment considerations for these types are found in the appendix B.3 of the CUDA C programming guide.

Grids, Threads and Warps

CUDA kernels are executed by threads which are grouped into different thread blocks. Thread blocks can at most consist of 1024 threads on modern NVIDIA GPUs. The reason for this is that all threads in a block must be executed on the same SMC (streaming multiprocessor core).

There are two parameters used when launching a CUDA kernel to configure the threads and thread blocks used by the kernel. They are either of type int or dim3, where

struct uint3 { unsigned int x; unsigned int y; unsigned int z; }
struct dim3 { unsigned int x; unsigned int y; unsigned int z; }

The firs parameter specifies the number of thread blocks and also

<<<blocks, threads>>>cudaKernel(...);

Synchronized Memory Copies

cudaError_t cudaMemcpy (void* dst, const void* src, size_t count, cudaMemcpyKind kind)

// allocation
data_type* dA, hA;
hA = (data_type*) malloc(size);
cudaMalloc(&dA, size);

// device to host transfer
cudaMemcpy(dA, hA,size, cudaMemcpyHostToDevice);

// host to device transfer
cudaMemcpy(hA, dA,size, cudaMemcpyDeviceToHost);

Pinned Memory

Pinned memory is easiest created with the function cudaHostAlloc is equivalent to cudaMallocHost when called with the flag cudaHostAllocDefault or 0 which is the value of the flag.

cudaError_t cudaHostAlloc(void** hostPointer, size_t size, unsigned int flags);
// example usage
cudaHostAlloc((void**) hostPointer, N*sizeof(double), cudaHostAllocDefault);

It is preferable to batch small arrays/variables into one larger pinned memory array for data transfers between device and host.

Streams and Asynchronous Memory Movement

It is possible to execute cuda kernels and memory data transfers asynchronously. Especially, parallel execution of cuda kernels and memory is possible wich CUDA streams.

cudaStream_t cudastream;

Mdpsan and multidimensional arrays

C-style

Let \(A\) be a \(m\times n\) matrix (\(m\) rows and \(n\) columns).

\[ A = \begin{pmatrix} a_{0,0} & a_{0,1} & \cdots & a_{0,n-1} \\ a_{1,0} & a_{1,1} & \cdots & a_{1,n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m-1,0} & a_{m,1} & \cdots & a_{m-1,n-1} \end{pmatrix} \]

There are two major storage layouts for dense matrices in memory:

row-major (C-style) array indexing of A stores the elements with rows at contiguous memory locations:

    a_ij =  A[n*i + j]

\[ A = (a_{0,0}, a_{0,1}, \dots, a_{0,n-1}, a_{1,0}, a_{1,1}, \dots, a_{1, n-1}, \dots, a_{m-1, n-1}). \]

column-major (Fortran-style) array indexing of A stores the elements

    a_ij =  A[m*j + i]

\[ A = (a_{0,0}, a_{1,0}, \dots, a_{m-1,0}, a_{0,1}, a_{1,1}, \dots, a_{m-1, 1}, \dots, a_{m-1, n-1}). \]

Assume that \(A\) is a submatrix of a larger \(m_M \times n_M\) matrix \(M\). Then the indexing must take this into account. The leading dimension of \(A\) is defined as the column/row size of the super-matrix \(M\) if \(M\) is row/column-major.

mdspan

The multidimensioanl span, i.e. mdspan is part of C++ 23 standard library, however there is reference implementation which is part of Kokkos and available at github mdspan. Anyone who has worked with optimizing linear algebra kernels or simply used the BLAS api will understand the value of this.

References

mdspan proposal
Multidimensional C++ CppNorth 2022 talk by Bryce Adelstein Lelbach, the best mdspan overview/introduction I have found on how to actually use it.