Code Architecture¶
This section describes the internal design and implementation of key µGrid components, including the FFT engine and discrete differential operators.
FFT Engine Design¶
The FFT engine provides distributed Fast Fourier Transform operations on structured grids with MPI parallelization. It uses pencil decomposition for efficient scaling to large numbers of MPI ranks.
Storage Order¶
Array of Structures (AoS) vs Structure of Arrays (SoA)¶
µGrid supports two storage orders for multi-component fields:
AoS (Array of Structures): Components are interleaved per pixel
Memory layout:
[c0p0, c1p0, c2p0, c0p1, c1p1, c2p1, ...]Used on CPU (default for
HostSpace)Stride between consecutive X elements =
nb_componentsComponent offset =
comp(component index)
SoA (Structure of Arrays): Components are stored in separate contiguous blocks
Memory layout:
[c0p0, c0p1, c0p2, ..., c1p0, c1p1, c1p2, ...]Used on GPU (default for
CUDASpaceandROCmSpace)Stride between consecutive X elements = 1
Component offset =
comp * nb_pixels
Why Different Storage Orders?¶
GPU (SoA): Coalesced memory access is critical for GPU performance. When threads access consecutive memory locations, the hardware can combine multiple accesses into a single transaction. SoA ensures that threads processing different pixels of the same component access contiguous memory.
CPU (AoS): Cache locality for multi-component operations. When processing a single pixel, all its components are in the same cache line.
Design Decision: FFT work buffers use the same storage order as the memory
space they reside in (SoA on GPU, AoS on CPU). This avoids expensive storage
order conversions during FFT operations. The FFT engine detects the storage
order of input/output fields using field.get_storage_order() and computes
the appropriate strides for each buffer.
Stride Calculations¶
2D Case¶
For a 2D field with dimensions [Nx, Ny] and nb_components components:
SoA strides:
comp_offset_factor = nb_pixels // Component i starts at i * nb_pixels
x_stride = 1 // Consecutive X elements are contiguous
row_dist = row_width // Distance between rows (Y direction)
AoS strides:
comp_offset_factor = 1 // Component i starts at offset i
x_stride = nb_components // Skip over all components between X elements
row_dist = row_width * nb_components // Distance between rows includes all components
3D Case¶
For a 3D field with dimensions [Nx, Ny, Nz]:
SoA strides:
comp_offset_factor = nb_pixels
x_stride = 1
y_dist = row_x
z_dist = row_x * rows_y
AoS strides:
comp_offset_factor = 1
x_stride = nb_components
y_dist = row_x * nb_components
z_dist = row_x * rows_y * nb_components
GPU Backend Limitations¶
cuFFT Strided R2C/C2R Limitation¶
cuFFT has a documented limitation: strided real-to-complex (R2C) and complex-to-real (C2R) transforms are not supported. The stride on the real data side must be 1.
This affects multi-component FFTs because with AoS storage, the stride between
consecutive real values would be nb_components (not 1).
Solution: With SoA storage order on GPU, each component’s data is contiguous (stride = 1), which satisfies cuFFT’s requirement. The FFT engine loops over components, executing one batched FFT per component.
rocFFT Native API¶
Unlike cuFFT, rocFFT’s native API (rocfft_plan_description_set_data_layout())
supports arbitrary strides for both input and output. However, for consistency
and to avoid storage order conversions, we use the same SoA approach on AMD GPUs.
MPI Parallel FFT Design¶
Pencil Decomposition¶
For 3D distributed FFT, the engine uses pencil decomposition:
Z-pencil: Data distributed in Y and Z, FFT along X (r2c)
Y-pencil: Transpose Y<->Z, FFT along Y (c2c)
X-pencil: Transpose X<->Z, FFT along Z (c2c)
Transpose Operations¶
The Transpose class handles MPI all-to-all communication for redistributing
data between pencil orientations.
Current limitation: Transpose operations assume AoS layout. When using GPU memory, this may require storage order conversion or updating the Transpose class to handle SoA.
Communication Efficiency¶
Design Decision: The FFT module is designed so that MPI communication does
not require explicit packing and unpacking. Data is laid out so that
MPI_Alltoall can operate directly on contiguous memory regions.
Forward 2D FFT Algorithm¶
1. r2c FFT along X for each component
- Input: real field (with ghosts)
- Output: work buffer (half-complex, no ghosts)
- Loop over components, batched over Y rows
2. [MPI only] Transpose X<->Y
- Redistributes data across ranks
- Changes from Y-local to X-local
3. c2c FFT along Y for each component
- In-place on work buffer (serial) or output (MPI)
- Batched over X values
4. [Serial only] Copy work to output
- Same storage order, direct copy
Forward 3D FFT Algorithm¶
1. r2c FFT along X for each component
- Input: real field
- Output: Z-pencil work buffer (half-complex in X)
- Batched over Y rows for each Z plane
2a. [MPI] Transpose Y<->Z
- Z-pencil to Y-pencil
2b. c2c FFT along Y for each component
- On Y-pencil buffer
- Individual transforms per (X, Z) position
2c. [MPI] Transpose Z<->Y
- Y-pencil back to Z-pencil
3. [MPI] Transpose X<->Z (or copy)
- Z-pencil to X-pencil (output)
4. c2c FFT along Z for each component
- On output buffer
- Individual transforms per (X, Y) position
5. [Serial only] Copy work to output
Per-Component Looping¶
Instead of attempting to batch across components (which would require non-unit strides on GPU), the FFT engine loops over components:
for (Index_t comp = 0; comp < nb_components; ++comp) {
Index_t in_comp_offset = comp * in_comp_factor;
Index_t out_comp_offset = comp * work_comp_factor;
backend->r2c(Nx, batch_size,
input_ptr + in_base + in_comp_offset,
in_x_stride, in_row_dist,
work_ptr + out_comp_offset,
work_x_stride, work_row_dist);
}
This approach:
Satisfies cuFFT’s unit-stride requirement for R2C/C2R
Allows efficient batching within each component
Works correctly for both SoA and AoS storage orders
Has minimal overhead (one kernel launch per component)
Normalization¶
Like FFTW and cuFFT, the transforms are unnormalized. A forward FFT followed by an inverse FFT multiplies the result by N (the transform size). Users must explicitly normalize if needed.
FFT File Structure¶
src/libmugrid/fft/
├── fft_engine.hh # Template class FFTEngine<MemorySpace>
├── fft_engine_base.hh/cc # Base class with field collections and transpose management
├── fft_1d_backend.hh # Abstract interface for 1D FFT backends
├── pocketfft_backend.cc # CPU backend using pocketfft
├── cufft_backend.cc # NVIDIA GPU backend using cuFFT
├── rocfft_backend.cc # AMD GPU backend using rocFFT
└── transpose.hh/cc # MPI transpose operations
Operator Kernel Design¶
This section describes the design of the discrete differential operators in µGrid, including their common design principles for ghost region handling and periodicity.
Operator Overview¶
µGrid provides several discrete differential operators optimized for structured grids:
Operator |
Description |
Stencil Size |
Input -> Output |
|---|---|---|---|
|
Flexible stencil-based convolution |
Configurable |
nodal -> quad or nodal -> nodal |
|
FEM gradient (nodal -> quadrature) |
8 nodes (3D), 4 nodes (2D) |
nodal -> quad |
|
7-point (3D) or 5-point (2D) Laplacian |
7 (3D), 5 (2D) |
nodal -> nodal |
|
Fused elasticity kernel |
8 nodes (3D), 4 nodes (2D) |
nodal -> nodal |
Common Design Principles¶
All operators follow these design principles for consistency and correctness:
Ghost Region Handling¶
All operators make no assumption over boundary conditions and operate on an interior region only. Distributed-memory parallelization and periodic boundary conditions are implemented through ghost cell communication:
Ghost regions are filled by
CartesianDecomposition::communicate_ghosts()before the operator is calledFor periodic BC: ghost regions contain copies from the opposite boundary
For non-periodic BC with Dirichlet conditions: boundary node values are constrained, and their forces are not computed by the fused kernels (they would be overwritten anyway)
Interior node computation only: For all BCs, operators compute forces only for interior nodes (e.g. indices 1 to n-2, assuming a ghost buffer of width 1), not boundary nodes (indices 0 and n-1)
Stencil Offset Convention¶
All operators use a stencil offset to define where the operator “centers”:
# Example: FEMGradientOperator with offset [0, 0, 0]
# Output element (0,0,0) uses input nodes at (0,0,0), (1,0,0), (0,1,0), etc.
grad_op = muGrid.FEMGradientOperator(3, grid_spacing)
# Example: GenericLinearOperator with custom offset
conv_op = muGrid.GenericLinearOperator([0, 0, 0], coefficients)
The offset determines which element’s quadrature points correspond to which nodal region.
Upfront Validation¶
Ghost region requirements are validated once before kernel execution, not during iteration:
Operators check that sufficient ghost cells exist in the field collection
Invalid configurations throw clear error messages
No bounds checking inside the hot loop
Memory Layout¶
CPU operators use AoS, and GPU operators use SoA memory layouts:
CPU: DOF components contiguous
[ux0, uy0, uz0, ux1, uy1, uz1, ...]GPU: Spatial indices contiguous
[ux0, ux1, ..., uy0, uy1, ..., uz0, uz1, ...]
The kernel implementations may differ between CPU and GPU as they may be optimized for the respective target architecture.
GenericLinearOperator Design¶
Purpose¶
Flexible stencil-based convolution operator that can represent any linear differential operator as a sparse stencil.
Design¶
Takes a list of coefficient arrays, one per output component per stencil point
Stencil defined by relative offsets from the current position
Supports both forward (apply) and adjoint (transpose) operations
Ghost Requirements¶
The forward and transpose operations have different ghost requirements because the stencil direction is reversed:
Forward (apply): reads at position p + s (stencil offset s)
Left ghosts:
max(-offset, 0)in each dimensionRight ghosts:
max(stencil_shape - 1 + offset, 0)in each dimension
Transpose: reads at position p - s (reversed stencil direction)
Left ghosts:
max(stencil_shape - 1 + offset, 0)in each dimension (swapped from apply’s right)Right ghosts:
max(-offset, 0)in each dimension (swapped from apply’s left)
Example: For a stencil with offset [0, 0, 0] and shape [2, 2, 2]:
Apply: left=0, right=1 (reads ahead)
Transpose: left=1, right=0 (gathers from behind)
Usage¶
# Create from FEMGradientOperator coefficients
grad_op = muGrid.FEMGradientOperator(3, grid_spacing)
conv_op = muGrid.GenericLinearOperator([0, 0, 0], grad_op.coefficients)
# Apply gradient
conv_op.apply(nodal_field, quadrature_field)
# Apply divergence (transpose)
conv_op.transpose(quadrature_field, nodal_field, quad_weights)
FEMGradientOperator Design¶
Purpose¶
Computes gradients at quadrature points from nodal displacements using linear finite element (P1) shape functions.
Mathematical Formulation¶
For linear FEM on structured grids:
2D: Each pixel decomposed into 2 triangles, 2 quadrature points
3D: Each voxel decomposed into 5 tetrahedra (Kuhn triangulation), 5 quadrature points
The gradient at quadrature point q is:
where \(B_q\) contains shape function gradients for the element containing q.
Shape Function Gradients¶
2D (2 triangles per pixel):
B_2D[triangle][dim][node]
Triangle 0: nodes 0,1,2 (lower-left)
Triangle 1: nodes 1,2,3 (upper-right)
3D (5 tetrahedra per voxel):
B_3D[tet][dim][node]
Tet 0: central tetrahedron (weight 1/3)
Tets 1-4: corner tetrahedra (weight 1/6 each)
Ghost Requirements¶
Forward (apply): Gradient - nodal -> quadrature
Input (nodal field): 1 ghost cell on right in each dimension
Output (quadrature field): no ghosts needed
Iteration: over elements [0, n-2], reads nodal values at element corners [ix, ix+1] x [iy, iy+1] x [iz, iz+1]
Transpose (transpose): Divergence - quadrature -> nodal
Input (quadrature/stress field): 1 ghost cell on left in each dimension (for periodic BC)
Output (nodal field): no specific ghost requirement
Iteration: over elements [0, n-2], accumulates contributions to nodal values at element corners
The ghost requirement swap occurs because:
Forward reads “ahead” (node ix+1 when at element ix) -> needs right ghosts
Transpose gathers from elements “behind” (element ix-1 contributes to node ix) -> needs left ghosts for periodic wraparound
LaplaceOperator Design¶
Purpose¶
Optimized discrete Laplacian for Poisson-type problems.
Mathematical Formulation¶
Standard finite difference Laplacian:
2D (5-point stencil):
1
1 -4 1
1
3D (7-point stencil):
Center: -6, Neighbors: +1 (x6)
Ghost Requirements¶
1 ghost cell in each direction (for accessing neighbors)
Properties¶
Self-adjoint: transpose equals forward apply
Negative semi-definite (positive semi-definite with
-scale)Configurable scale factor for grid spacing and sign conventions
Usage¶
# Create Laplacian with scale = -1/h^2 for positive-definite form
laplace = muGrid.LaplaceOperator3D(scale=-1.0 / h**2)
laplace.apply(u, laplacian_u)
IsotropicStiffnessOperator Design¶
Purpose¶
Fused kernel computing force = K @ displacement for isotropic linear elastic
materials, avoiding explicit assembly of the stiffness matrix K.
Mathematical Formulation¶
For isotropic materials, the element stiffness matrix decomposes as:
where:
\(G = \sum_q w_q B_q^T I' B_q\) - geometry matrix (shear stiffness)
\(V = \sum_q w_q (B_q^T m)(m^T B_q)\) - volumetric coupling matrix
\(I' = \text{diag}(1, 1, 1, 0.5, 0.5, 0.5)\) - Voigt scaling for strain energy
\(m = [1, 1, 1, 0, 0, 0]^T\) - trace selector vector
This reduces memory from O(N x DOF^2) for full K to O(N x 2) for spatially-varying materials.
Element Decomposition¶
2D: 2 triangles per pixel
Quadrature weights:
[0.5, 0.5]
3D: 5 tetrahedra per voxel (Kuhn triangulation)
Quadrature weights:
[1/3, 1/6, 1/6, 1/6, 1/6]
Field Requirements¶
All fields use node-based indexing with the same grid dimensions:
Field |
Dimensions |
Left Ghosts |
Right Ghosts |
|---|---|---|---|
Displacement |
(nx, ny, nz) |
1 |
1 |
Force |
(nx, ny, nz) |
1 |
1 |
Material (lam, mu) |
(nx, ny, nz) |
1 |
1 |
Key design principle: The kernel does not distinguish between periodic and
non-periodic boundary conditions. Ghost cell communication (via
CartesianDecomposition::communicate_ghosts()) handles periodicity:
For periodic BC: ghost cells contain copies from the opposite boundary
For non-periodic BC: ghost cells are filled with appropriate boundary values
This unified approach simplifies the kernel implementation and enables consistent behavior across CPU and GPU.
Gather Pattern¶
The kernel iterates over interior nodes and gathers contributions from all neighboring elements:
2D (4 elements per node):
Element offsets and local node indices:
(-1,-1) -> local node 3 (0,-1) -> local node 2
(-1, 0) -> local node 1 (0, 0) -> local node 0
3D (8 elements per node):
Element offsets and local node indices:
(-1,-1,-1) -> local node 7 (0,-1,-1) -> local node 6
(-1, 0,-1) -> local node 5 (0, 0,-1) -> local node 4
(-1,-1, 0) -> local node 3 (0,-1, 0) -> local node 2
(-1, 0, 0) -> local node 1 (0, 0, 0) -> local node 0
Iteration Bounds¶
The kernel iterates over all interior nodes (indices 0 to n-1):
// Iterate over all nodes - ghost communication handles periodicity
for (ix = 0; ix < nnx; ix++):
for (iy = 0; iy < nny; iy++):
// Compute force at node (ix, iy)
The kernel reads from ghost cells at positions -1 (left) and n (right), which must be populated before calling the operator. For periodic BC, these contain copies from the opposite boundary. For non-periodic BC, these contain boundary values (typically zero for homogeneous Dirichlet conditions).
Kernel Pseudocode¶
for each node (ix, iy, iz) in [0, nnx) x [0, nny) x [0, nnz):
f = [0, 0, 0] // force accumulator
for each neighboring element (8 in 3D):
// Element indices can be -1 (left ghost) or nx (right ghost)
ex, ey, ez = element indices relative to node
local_node = which corner of element is this node
// Material fields are node-indexed, read from element position
// (element at ex,ey,ez has material stored at node ex,ey,ez)
lam = material_lambda[ex, ey, ez]
mu = material_mu[ex, ey, ez]
// Gather displacement from all element nodes (may read ghosts)
u = [displacement at each element corner node]
// Compute stiffness contribution
for each DOF d:
row = local_node * NB_DOFS + d
f[d] += sum_j (2*mu * G[row,j] + lam * V[row,j]) * u[j]
force[ix, iy, iz] = f
Memory Layout Details¶
CPU Layout (Array of Structures)¶
DOF components are contiguous in memory:
Memory: [ux0, uy0, uz0, ux1, uy1, uz1, ux2, uy2, uz2, ...]
Strides:
disp_stride_d = 1; // DOF components contiguous
disp_stride_x = NB_DOFS; // = 3 for 3D
disp_stride_y = NB_DOFS * nx;
disp_stride_z = NB_DOFS * nx * ny;
GPU Layout (Structure of Arrays)¶
Spatial indices are contiguous for coalesced memory access:
Memory: [ux0, ux1, ux2, ..., uy0, uy1, uy2, ..., uz0, uz1, uz2, ...]
Strides:
disp_stride_x = 1; // Spatial x contiguous
disp_stride_y = nx;
disp_stride_z = nx * ny;
disp_stride_d = nx * ny * nz; // DOF components separated
Data Pointer Offset¶
Pointers are offset to account for ghost cells:
const Real* disp_data = displacement.data() + ghost_offset;
// Now index 0 = first interior node, index -1 = left ghost
Operator File Structure¶
src/libmugrid/operators/
├── linear.hh # Base LinearOperator class
├── generic.hh/.cc # GenericLinearOperator
├── laplace_2d.hh/.cc # LaplaceOperator2D
├── laplace_3d.hh/.cc # LaplaceOperator3D
├── fem_gradient_2d.hh/.cc # FEMGradientOperator2D
├── fem_gradient_3d.hh/.cc # FEMGradientOperator3D
└── solids/
├── isotropic_stiffness_2d.hh/.cc # IsotropicStiffnessOperator2D
├── isotropic_stiffness_3d.hh/.cc # IsotropicStiffnessOperator3D
└── isotropic_stiffness_gpu.cc # GPU kernels
Testing¶
Tests are in tests/python_isotropic_stiffness_operator_tests.py:
test_compare_with_generic_operator: Compares fused kernel against explicit B^T C Btest_unit_impulse_*: Verifies response to unit displacement at specific nodestest_symmetry: Verifies K is symmetricValidationGuardTest*: Verifies proper error handling for invalid configurationsGPUUnitImpulseTest: Verifies GPU kernel matches CPU kernel output
With node-based indexing, all tests compare the full output (all nodes), not just interior nodes.