Code Architecture

This section describes the internal design and implementation of key µGrid components, including the FFT engine and discrete differential operators.

FFT Engine Design

The FFT engine provides distributed Fast Fourier Transform operations on structured grids with MPI parallelization. It uses pencil decomposition for efficient scaling to large numbers of MPI ranks.

Storage Order

Array of Structures (AoS) vs Structure of Arrays (SoA)

µGrid supports two storage orders for multi-component fields:

AoS (Array of Structures): Components are interleaved per pixel

  • Memory layout: [c0p0, c1p0, c2p0, c0p1, c1p1, c2p1, ...]

  • Used on CPU (default for HostSpace)

  • Stride between consecutive X elements = nb_components

  • Component offset = comp (component index)

SoA (Structure of Arrays): Components are stored in separate contiguous blocks

  • Memory layout: [c0p0, c0p1, c0p2, ..., c1p0, c1p1, c1p2, ...]

  • Used on GPU (default for CUDASpace and ROCmSpace)

  • Stride between consecutive X elements = 1

  • Component offset = comp * nb_pixels

Why Different Storage Orders?

GPU (SoA): Coalesced memory access is critical for GPU performance. When threads access consecutive memory locations, the hardware can combine multiple accesses into a single transaction. SoA ensures that threads processing different pixels of the same component access contiguous memory.

CPU (AoS): Cache locality for multi-component operations. When processing a single pixel, all its components are in the same cache line.

Design Decision: FFT work buffers use the same storage order as the memory space they reside in (SoA on GPU, AoS on CPU). This avoids expensive storage order conversions during FFT operations. The FFT engine detects the storage order of input/output fields using field.get_storage_order() and computes the appropriate strides for each buffer.

Stride Calculations

2D Case

For a 2D field with dimensions [Nx, Ny] and nb_components components:

SoA strides:

comp_offset_factor = nb_pixels        // Component i starts at i * nb_pixels
x_stride = 1                          // Consecutive X elements are contiguous
row_dist = row_width                  // Distance between rows (Y direction)

AoS strides:

comp_offset_factor = 1                // Component i starts at offset i
x_stride = nb_components              // Skip over all components between X elements
row_dist = row_width * nb_components  // Distance between rows includes all components

3D Case

For a 3D field with dimensions [Nx, Ny, Nz]:

SoA strides:

comp_offset_factor = nb_pixels
x_stride = 1
y_dist = row_x
z_dist = row_x * rows_y

AoS strides:

comp_offset_factor = 1
x_stride = nb_components
y_dist = row_x * nb_components
z_dist = row_x * rows_y * nb_components

GPU Backend Limitations

cuFFT Strided R2C/C2R Limitation

cuFFT has a documented limitation: strided real-to-complex (R2C) and complex-to-real (C2R) transforms are not supported. The stride on the real data side must be 1.

This affects multi-component FFTs because with AoS storage, the stride between consecutive real values would be nb_components (not 1).

Solution: With SoA storage order on GPU, each component’s data is contiguous (stride = 1), which satisfies cuFFT’s requirement. The FFT engine loops over components, executing one batched FFT per component.

rocFFT Native API

Unlike cuFFT, rocFFT’s native API (rocfft_plan_description_set_data_layout()) supports arbitrary strides for both input and output. However, for consistency and to avoid storage order conversions, we use the same SoA approach on AMD GPUs.

MPI Parallel FFT Design

Pencil Decomposition

For 3D distributed FFT, the engine uses pencil decomposition:

  1. Z-pencil: Data distributed in Y and Z, FFT along X (r2c)

  2. Y-pencil: Transpose Y<->Z, FFT along Y (c2c)

  3. X-pencil: Transpose X<->Z, FFT along Z (c2c)

Transpose Operations

The Transpose class handles MPI all-to-all communication for redistributing data between pencil orientations.

Current limitation: Transpose operations assume AoS layout. When using GPU memory, this may require storage order conversion or updating the Transpose class to handle SoA.

Communication Efficiency

Design Decision: The FFT module is designed so that MPI communication does not require explicit packing and unpacking. Data is laid out so that MPI_Alltoall can operate directly on contiguous memory regions.

Forward 2D FFT Algorithm

1. r2c FFT along X for each component
   - Input: real field (with ghosts)
   - Output: work buffer (half-complex, no ghosts)
   - Loop over components, batched over Y rows

2. [MPI only] Transpose X<->Y
   - Redistributes data across ranks
   - Changes from Y-local to X-local

3. c2c FFT along Y for each component
   - In-place on work buffer (serial) or output (MPI)
   - Batched over X values

4. [Serial only] Copy work to output
   - Same storage order, direct copy

Forward 3D FFT Algorithm

1. r2c FFT along X for each component
   - Input: real field
   - Output: Z-pencil work buffer (half-complex in X)
   - Batched over Y rows for each Z plane

2a. [MPI] Transpose Y<->Z
    - Z-pencil to Y-pencil

2b. c2c FFT along Y for each component
    - On Y-pencil buffer
    - Individual transforms per (X, Z) position

2c. [MPI] Transpose Z<->Y
    - Y-pencil back to Z-pencil

3. [MPI] Transpose X<->Z (or copy)
   - Z-pencil to X-pencil (output)

4. c2c FFT along Z for each component
   - On output buffer
   - Individual transforms per (X, Y) position

5. [Serial only] Copy work to output

Per-Component Looping

Instead of attempting to batch across components (which would require non-unit strides on GPU), the FFT engine loops over components:

for (Index_t comp = 0; comp < nb_components; ++comp) {
    Index_t in_comp_offset = comp * in_comp_factor;
    Index_t out_comp_offset = comp * work_comp_factor;
    backend->r2c(Nx, batch_size,
                 input_ptr + in_base + in_comp_offset,
                 in_x_stride, in_row_dist,
                 work_ptr + out_comp_offset,
                 work_x_stride, work_row_dist);
}

This approach:

  • Satisfies cuFFT’s unit-stride requirement for R2C/C2R

  • Allows efficient batching within each component

  • Works correctly for both SoA and AoS storage orders

  • Has minimal overhead (one kernel launch per component)

Normalization

Like FFTW and cuFFT, the transforms are unnormalized. A forward FFT followed by an inverse FFT multiplies the result by N (the transform size). Users must explicitly normalize if needed.

FFT File Structure

src/libmugrid/fft/
├── fft_engine.hh         # Template class FFTEngine<MemorySpace>
├── fft_engine_base.hh/cc # Base class with field collections and transpose management
├── fft_1d_backend.hh     # Abstract interface for 1D FFT backends
├── pocketfft_backend.cc  # CPU backend using pocketfft
├── cufft_backend.cc      # NVIDIA GPU backend using cuFFT
├── rocfft_backend.cc     # AMD GPU backend using rocFFT
└── transpose.hh/cc       # MPI transpose operations

Operator Kernel Design

This section describes the design of the discrete differential operators in µGrid, including their common design principles for ghost region handling and periodicity.

Operator Overview

µGrid provides several discrete differential operators optimized for structured grids:

Operator

Description

Stencil Size

Input -> Output

GenericLinearOperator

Flexible stencil-based convolution

Configurable

nodal -> quad or nodal -> nodal

FEMGradientOperator

FEM gradient (nodal -> quadrature)

8 nodes (3D), 4 nodes (2D)

nodal -> quad

LaplaceOperator

7-point (3D) or 5-point (2D) Laplacian

7 (3D), 5 (2D)

nodal -> nodal

IsotropicStiffnessOperator

Fused elasticity kernel

8 nodes (3D), 4 nodes (2D)

nodal -> nodal

Common Design Principles

All operators follow these design principles for consistency and correctness:

Ghost Region Handling

All operators make no assumption over boundary conditions and operate on an interior region only. Distributed-memory parallelization and periodic boundary conditions are implemented through ghost cell communication:

  • Ghost regions are filled by CartesianDecomposition::communicate_ghosts() before the operator is called

  • For periodic BC: ghost regions contain copies from the opposite boundary

  • For non-periodic BC with Dirichlet conditions: boundary node values are constrained, and their forces are not computed by the fused kernels (they would be overwritten anyway)

  • Interior node computation only: For all BCs, operators compute forces only for interior nodes (e.g. indices 1 to n-2, assuming a ghost buffer of width 1), not boundary nodes (indices 0 and n-1)

Stencil Offset Convention

All operators use a stencil offset to define where the operator “centers”:

# Example: FEMGradientOperator with offset [0, 0, 0]
# Output element (0,0,0) uses input nodes at (0,0,0), (1,0,0), (0,1,0), etc.
grad_op = muGrid.FEMGradientOperator(3, grid_spacing)

# Example: GenericLinearOperator with custom offset
conv_op = muGrid.GenericLinearOperator([0, 0, 0], coefficients)

The offset determines which element’s quadrature points correspond to which nodal region.

Upfront Validation

Ghost region requirements are validated once before kernel execution, not during iteration:

  • Operators check that sufficient ghost cells exist in the field collection

  • Invalid configurations throw clear error messages

  • No bounds checking inside the hot loop

Memory Layout

CPU operators use AoS, and GPU operators use SoA memory layouts:

  • CPU: DOF components contiguous [ux0, uy0, uz0, ux1, uy1, uz1, ...]

  • GPU: Spatial indices contiguous [ux0, ux1, ..., uy0, uy1, ..., uz0, uz1, ...]

The kernel implementations may differ between CPU and GPU as they may be optimized for the respective target architecture.

GenericLinearOperator Design

Purpose

Flexible stencil-based convolution operator that can represent any linear differential operator as a sparse stencil.

Design

  • Takes a list of coefficient arrays, one per output component per stencil point

  • Stencil defined by relative offsets from the current position

  • Supports both forward (apply) and adjoint (transpose) operations

Ghost Requirements

The forward and transpose operations have different ghost requirements because the stencil direction is reversed:

Forward (apply): reads at position p + s (stencil offset s)

  • Left ghosts: max(-offset, 0) in each dimension

  • Right ghosts: max(stencil_shape - 1 + offset, 0) in each dimension

Transpose: reads at position p - s (reversed stencil direction)

  • Left ghosts: max(stencil_shape - 1 + offset, 0) in each dimension (swapped from apply’s right)

  • Right ghosts: max(-offset, 0) in each dimension (swapped from apply’s left)

Example: For a stencil with offset [0, 0, 0] and shape [2, 2, 2]:

  • Apply: left=0, right=1 (reads ahead)

  • Transpose: left=1, right=0 (gathers from behind)

Usage

# Create from FEMGradientOperator coefficients
grad_op = muGrid.FEMGradientOperator(3, grid_spacing)
conv_op = muGrid.GenericLinearOperator([0, 0, 0], grad_op.coefficients)

# Apply gradient
conv_op.apply(nodal_field, quadrature_field)

# Apply divergence (transpose)
conv_op.transpose(quadrature_field, nodal_field, quad_weights)

FEMGradientOperator Design

Purpose

Computes gradients at quadrature points from nodal displacements using linear finite element (P1) shape functions.

Mathematical Formulation

For linear FEM on structured grids:

  • 2D: Each pixel decomposed into 2 triangles, 2 quadrature points

  • 3D: Each voxel decomposed into 5 tetrahedra (Kuhn triangulation), 5 quadrature points

The gradient at quadrature point q is:

\[\frac{\partial u}{\partial x_i} = \sum_I B_q[i,I] \times u_I / h_i\]

where \(B_q\) contains shape function gradients for the element containing q.

Shape Function Gradients

2D (2 triangles per pixel):

B_2D[triangle][dim][node]
Triangle 0: nodes 0,1,2 (lower-left)
Triangle 1: nodes 1,2,3 (upper-right)

3D (5 tetrahedra per voxel):

B_3D[tet][dim][node]
Tet 0: central tetrahedron (weight 1/3)
Tets 1-4: corner tetrahedra (weight 1/6 each)

Ghost Requirements

Forward (apply): Gradient - nodal -> quadrature

  • Input (nodal field): 1 ghost cell on right in each dimension

  • Output (quadrature field): no ghosts needed

  • Iteration: over elements [0, n-2], reads nodal values at element corners [ix, ix+1] x [iy, iy+1] x [iz, iz+1]

Transpose (transpose): Divergence - quadrature -> nodal

  • Input (quadrature/stress field): 1 ghost cell on left in each dimension (for periodic BC)

  • Output (nodal field): no specific ghost requirement

  • Iteration: over elements [0, n-2], accumulates contributions to nodal values at element corners

The ghost requirement swap occurs because:

  • Forward reads “ahead” (node ix+1 when at element ix) -> needs right ghosts

  • Transpose gathers from elements “behind” (element ix-1 contributes to node ix) -> needs left ghosts for periodic wraparound

LaplaceOperator Design

Purpose

Optimized discrete Laplacian for Poisson-type problems.

Mathematical Formulation

Standard finite difference Laplacian:

\[\nabla^2 u \approx \frac{u[i+1] + u[i-1] - 2u[i]}{h^2} + \ldots\]

2D (5-point stencil):

  1
1 -4 1
  1

3D (7-point stencil):

Center: -6, Neighbors: +1 (x6)

Ghost Requirements

  • 1 ghost cell in each direction (for accessing neighbors)

Properties

  • Self-adjoint: transpose equals forward apply

  • Negative semi-definite (positive semi-definite with -scale)

  • Configurable scale factor for grid spacing and sign conventions

Usage

# Create Laplacian with scale = -1/h^2 for positive-definite form
laplace = muGrid.LaplaceOperator3D(scale=-1.0 / h**2)
laplace.apply(u, laplacian_u)

IsotropicStiffnessOperator Design

Purpose

Fused kernel computing force = K @ displacement for isotropic linear elastic materials, avoiding explicit assembly of the stiffness matrix K.

Mathematical Formulation

For isotropic materials, the element stiffness matrix decomposes as:

\[K_e = 2\mu G + \lambda V\]

where:

  • \(G = \sum_q w_q B_q^T I' B_q\) - geometry matrix (shear stiffness)

  • \(V = \sum_q w_q (B_q^T m)(m^T B_q)\) - volumetric coupling matrix

  • \(I' = \text{diag}(1, 1, 1, 0.5, 0.5, 0.5)\) - Voigt scaling for strain energy

  • \(m = [1, 1, 1, 0, 0, 0]^T\) - trace selector vector

This reduces memory from O(N x DOF^2) for full K to O(N x 2) for spatially-varying materials.

Element Decomposition

2D: 2 triangles per pixel

  • Quadrature weights: [0.5, 0.5]

3D: 5 tetrahedra per voxel (Kuhn triangulation)

  • Quadrature weights: [1/3, 1/6, 1/6, 1/6, 1/6]

Field Requirements

All fields use node-based indexing with the same grid dimensions:

Field

Dimensions

Left Ghosts

Right Ghosts

Displacement

(nx, ny, nz)

1

1

Force

(nx, ny, nz)

1

1

Material (lam, mu)

(nx, ny, nz)

1

1

Key design principle: The kernel does not distinguish between periodic and non-periodic boundary conditions. Ghost cell communication (via CartesianDecomposition::communicate_ghosts()) handles periodicity:

  • For periodic BC: ghost cells contain copies from the opposite boundary

  • For non-periodic BC: ghost cells are filled with appropriate boundary values

This unified approach simplifies the kernel implementation and enables consistent behavior across CPU and GPU.

Gather Pattern

The kernel iterates over interior nodes and gathers contributions from all neighboring elements:

2D (4 elements per node):

Element offsets and local node indices:
  (-1,-1) -> local node 3    (0,-1) -> local node 2
  (-1, 0) -> local node 1    (0, 0) -> local node 0

3D (8 elements per node):

Element offsets and local node indices:
  (-1,-1,-1) -> local node 7    (0,-1,-1) -> local node 6
  (-1, 0,-1) -> local node 5    (0, 0,-1) -> local node 4
  (-1,-1, 0) -> local node 3    (0,-1, 0) -> local node 2
  (-1, 0, 0) -> local node 1    (0, 0, 0) -> local node 0

Iteration Bounds

The kernel iterates over all interior nodes (indices 0 to n-1):

// Iterate over all nodes - ghost communication handles periodicity
for (ix = 0; ix < nnx; ix++):
    for (iy = 0; iy < nny; iy++):
        // Compute force at node (ix, iy)

The kernel reads from ghost cells at positions -1 (left) and n (right), which must be populated before calling the operator. For periodic BC, these contain copies from the opposite boundary. For non-periodic BC, these contain boundary values (typically zero for homogeneous Dirichlet conditions).

Kernel Pseudocode

for each node (ix, iy, iz) in [0, nnx) x [0, nny) x [0, nnz):
    f = [0, 0, 0]  // force accumulator

    for each neighboring element (8 in 3D):
        // Element indices can be -1 (left ghost) or nx (right ghost)
        ex, ey, ez = element indices relative to node
        local_node = which corner of element is this node

        // Material fields are node-indexed, read from element position
        // (element at ex,ey,ez has material stored at node ex,ey,ez)
        lam = material_lambda[ex, ey, ez]
        mu = material_mu[ex, ey, ez]

        // Gather displacement from all element nodes (may read ghosts)
        u = [displacement at each element corner node]

        // Compute stiffness contribution
        for each DOF d:
            row = local_node * NB_DOFS + d
            f[d] += sum_j (2*mu * G[row,j] + lam * V[row,j]) * u[j]

    force[ix, iy, iz] = f

Memory Layout Details

CPU Layout (Array of Structures)

DOF components are contiguous in memory:

Memory: [ux0, uy0, uz0, ux1, uy1, uz1, ux2, uy2, uz2, ...]

Strides:

disp_stride_d = 1;                    // DOF components contiguous
disp_stride_x = NB_DOFS;              // = 3 for 3D
disp_stride_y = NB_DOFS * nx;
disp_stride_z = NB_DOFS * nx * ny;

GPU Layout (Structure of Arrays)

Spatial indices are contiguous for coalesced memory access:

Memory: [ux0, ux1, ux2, ..., uy0, uy1, uy2, ..., uz0, uz1, uz2, ...]

Strides:

disp_stride_x = 1;                    // Spatial x contiguous
disp_stride_y = nx;
disp_stride_z = nx * ny;
disp_stride_d = nx * ny * nz;         // DOF components separated

Data Pointer Offset

Pointers are offset to account for ghost cells:

const Real* disp_data = displacement.data() + ghost_offset;
// Now index 0 = first interior node, index -1 = left ghost

Operator File Structure

src/libmugrid/operators/
├── linear.hh              # Base LinearOperator class
├── generic.hh/.cc         # GenericLinearOperator
├── laplace_2d.hh/.cc      # LaplaceOperator2D
├── laplace_3d.hh/.cc      # LaplaceOperator3D
├── fem_gradient_2d.hh/.cc # FEMGradientOperator2D
├── fem_gradient_3d.hh/.cc # FEMGradientOperator3D
└── solids/
    ├── isotropic_stiffness_2d.hh/.cc  # IsotropicStiffnessOperator2D
    ├── isotropic_stiffness_3d.hh/.cc  # IsotropicStiffnessOperator3D
    └── isotropic_stiffness_gpu.cc     # GPU kernels

Testing

Tests are in tests/python_isotropic_stiffness_operator_tests.py:

  • test_compare_with_generic_operator: Compares fused kernel against explicit B^T C B

  • test_unit_impulse_*: Verifies response to unit displacement at specific nodes

  • test_symmetry: Verifies K is symmetric

  • ValidationGuardTest*: Verifies proper error handling for invalid configurations

  • GPUUnitImpulseTest: Verifies GPU kernel matches CPU kernel output

With node-based indexing, all tests compare the full output (all nodes), not just interior nodes.