Code Architecture
#################

This section describes the internal design and implementation of key *µ*\Grid
components, including the FFT engine and discrete differential operators.

.. contents:: Table of Contents
   :local:
   :depth: 2

FFT Engine Design
*****************

The FFT engine provides distributed Fast Fourier Transform operations on
structured grids with MPI parallelization. It uses pencil decomposition for
efficient scaling to large numbers of MPI ranks.

Storage Order
=============

Array of Structures (AoS) vs Structure of Arrays (SoA)
------------------------------------------------------

*µ*\Grid supports two storage orders for multi-component fields:

**AoS (Array of Structures)**: Components are interleaved per pixel

- Memory layout: ``[c0p0, c1p0, c2p0, c0p1, c1p1, c2p1, ...]``
- Used on **CPU** (default for ``HostSpace``)
- Stride between consecutive X elements = ``nb_components``
- Component offset = ``comp`` (component index)

**SoA (Structure of Arrays)**: Components are stored in separate contiguous blocks

- Memory layout: ``[c0p0, c0p1, c0p2, ..., c1p0, c1p1, c1p2, ...]``
- Used on **GPU** (default for ``CUDASpace`` and ``ROCmSpace``)
- Stride between consecutive X elements = 1
- Component offset = ``comp * nb_pixels``

Why Different Storage Orders?
-----------------------------

**GPU (SoA)**: Coalesced memory access is critical for GPU performance. When
threads access consecutive memory locations, the hardware can combine multiple
accesses into a single transaction. SoA ensures that threads processing different
pixels of the same component access contiguous memory.

**CPU (AoS)**: Cache locality for multi-component operations. When processing a
single pixel, all its components are in the same cache line.

**Design Decision**: FFT work buffers use the same storage order as the memory
space they reside in (SoA on GPU, AoS on CPU). This avoids expensive storage
order conversions during FFT operations. The FFT engine detects the storage
order of input/output fields using ``field.get_storage_order()`` and computes
the appropriate strides for each buffer.

Stride Calculations
===================

2D Case
-------

For a 2D field with dimensions ``[Nx, Ny]`` and ``nb_components`` components:

**SoA strides:**

.. code-block:: text

    comp_offset_factor = nb_pixels        // Component i starts at i * nb_pixels
    x_stride = 1                          // Consecutive X elements are contiguous
    row_dist = row_width                  // Distance between rows (Y direction)

**AoS strides:**

.. code-block:: text

    comp_offset_factor = 1                // Component i starts at offset i
    x_stride = nb_components              // Skip over all components between X elements
    row_dist = row_width * nb_components  // Distance between rows includes all components

3D Case
-------

For a 3D field with dimensions ``[Nx, Ny, Nz]``:

**SoA strides:**

.. code-block:: text

    comp_offset_factor = nb_pixels
    x_stride = 1
    y_dist = row_x
    z_dist = row_x * rows_y

**AoS strides:**

.. code-block:: text

    comp_offset_factor = 1
    x_stride = nb_components
    y_dist = row_x * nb_components
    z_dist = row_x * rows_y * nb_components

GPU Backend Limitations
=======================

cuFFT Strided R2C/C2R Limitation
--------------------------------

cuFFT has a documented limitation: **strided real-to-complex (R2C) and
complex-to-real (C2R) transforms are not supported**. The stride on the real
data side must be 1.

This affects multi-component FFTs because with AoS storage, the stride between
consecutive real values would be ``nb_components`` (not 1).

**Solution**: With SoA storage order on GPU, each component's data is contiguous
(stride = 1), which satisfies cuFFT's requirement. The FFT engine loops over
components, executing one batched FFT per component.

rocFFT Native API
-----------------

Unlike cuFFT, rocFFT's native API (``rocfft_plan_description_set_data_layout()``)
supports arbitrary strides for both input and output. However, for consistency
and to avoid storage order conversions, we use the same SoA approach on AMD GPUs.

MPI Parallel FFT Design
=======================

Pencil Decomposition
--------------------

For 3D distributed FFT, the engine uses pencil decomposition:

1. **Z-pencil**: Data distributed in Y and Z, FFT along X (r2c)
2. **Y-pencil**: Transpose Y<->Z, FFT along Y (c2c)
3. **X-pencil**: Transpose X<->Z, FFT along Z (c2c)

Transpose Operations
--------------------

The ``Transpose`` class handles MPI all-to-all communication for redistributing
data between pencil orientations.

**Current limitation**: Transpose operations assume AoS layout. When using GPU
memory, this may require storage order conversion or updating the Transpose
class to handle SoA.

Communication Efficiency
------------------------

**Design Decision**: The FFT module is designed so that MPI communication does
not require explicit packing and unpacking. Data is laid out so that
``MPI_Alltoall`` can operate directly on contiguous memory regions.

Forward 2D FFT Algorithm
========================

.. code-block:: text

    1. r2c FFT along X for each component
       - Input: real field (with ghosts)
       - Output: work buffer (half-complex, no ghosts)
       - Loop over components, batched over Y rows

    2. [MPI only] Transpose X<->Y
       - Redistributes data across ranks
       - Changes from Y-local to X-local

    3. c2c FFT along Y for each component
       - In-place on work buffer (serial) or output (MPI)
       - Batched over X values

    4. [Serial only] Copy work to output
       - Same storage order, direct copy

Forward 3D FFT Algorithm
========================

.. code-block:: text

    1. r2c FFT along X for each component
       - Input: real field
       - Output: Z-pencil work buffer (half-complex in X)
       - Batched over Y rows for each Z plane

    2a. [MPI] Transpose Y<->Z
        - Z-pencil to Y-pencil

    2b. c2c FFT along Y for each component
        - On Y-pencil buffer
        - Individual transforms per (X, Z) position

    2c. [MPI] Transpose Z<->Y
        - Y-pencil back to Z-pencil

    3. [MPI] Transpose X<->Z (or copy)
       - Z-pencil to X-pencil (output)

    4. c2c FFT along Z for each component
       - On output buffer
       - Individual transforms per (X, Y) position

    5. [Serial only] Copy work to output

Per-Component Looping
=====================

Instead of attempting to batch across components (which would require non-unit
strides on GPU), the FFT engine loops over components:

.. code-block:: cpp

    for (Index_t comp = 0; comp < nb_components; ++comp) {
        Index_t in_comp_offset = comp * in_comp_factor;
        Index_t out_comp_offset = comp * work_comp_factor;
        backend->r2c(Nx, batch_size,
                     input_ptr + in_base + in_comp_offset,
                     in_x_stride, in_row_dist,
                     work_ptr + out_comp_offset,
                     work_x_stride, work_row_dist);
    }

This approach:

- Satisfies cuFFT's unit-stride requirement for R2C/C2R
- Allows efficient batching within each component
- Works correctly for both SoA and AoS storage orders
- Has minimal overhead (one kernel launch per component)

Normalization
=============

Like FFTW and cuFFT, the transforms are **unnormalized**. A forward FFT followed
by an inverse FFT multiplies the result by N (the transform size). Users must
explicitly normalize if needed.

FFT File Structure
==================

.. code-block:: text

    src/libmugrid/fft/
    ├── fft_engine.hh         # Template class FFTEngine<MemorySpace>
    ├── fft_engine_base.hh/cc # Base class with field collections and transpose management
    ├── fft_1d_backend.hh     # Abstract interface for 1D FFT backends
    ├── pocketfft_backend.cc  # CPU backend using pocketfft
    ├── cufft_backend.cc      # NVIDIA GPU backend using cuFFT
    ├── rocfft_backend.cc     # AMD GPU backend using rocFFT
    └── transpose.hh/cc       # MPI transpose operations

Operator Kernel Design
**********************

This section describes the design of the discrete differential operators in
*µ*\Grid, including their common design principles for ghost region handling
and periodicity.

Operator Overview
=================

*µ*\Grid provides several discrete differential operators optimized for structured grids:

.. list-table::
   :header-rows: 1
   :widths: 25 30 20 25

   * - Operator
     - Description
     - Stencil Size
     - Input -> Output
   * - ``GenericLinearOperator``
     - Flexible stencil-based convolution
     - Configurable
     - nodal -> quad or nodal -> nodal
   * - ``FEMGradientOperator``
     - FEM gradient (nodal -> quadrature)
     - 8 nodes (3D), 4 nodes (2D)
     - nodal -> quad
   * - ``LaplaceOperator``
     - 7-point (3D) or 5-point (2D) Laplacian
     - 7 (3D), 5 (2D)
     - nodal -> nodal
   * - ``IsotropicStiffnessOperator``
     - Fused elasticity kernel
     - 8 nodes (3D), 4 nodes (2D)
     - nodal -> nodal

Common Design Principles
========================

All operators follow these design principles for consistency and correctness:

Ghost Region Handling
---------------------

All operators make no assumption over boundary conditions and operate on an
interior region only. Distributed-memory parallelization and periodic boundary
conditions are implemented through ghost cell communication:

- **Ghost regions** are filled by ``CartesianDecomposition::communicate_ghosts()``
  before the operator is called
- For **periodic BC**: ghost regions contain copies from the opposite boundary
- For **non-periodic BC** with Dirichlet conditions: boundary node values are
  constrained, and their forces are not computed by the fused kernels (they
  would be overwritten anyway)
- **Interior node computation only**: For all BCs, operators compute forces only
  for interior nodes (e.g. indices 1 to n-2, assuming a ghost buffer of width 1),
  not boundary nodes (indices 0 and n-1)

Stencil Offset Convention
-------------------------

All operators use a stencil offset to define where the operator "centers":

.. code-block:: python

    # Example: FEMGradientOperator with offset [0, 0, 0]
    # Output element (0,0,0) uses input nodes at (0,0,0), (1,0,0), (0,1,0), etc.
    grad_op = muGrid.FEMGradientOperator(3, grid_spacing)

    # Example: GenericLinearOperator with custom offset
    conv_op = muGrid.GenericLinearOperator([0, 0, 0], coefficients)

The offset determines which element's quadrature points correspond to which
nodal region.

Upfront Validation
------------------

Ghost region requirements are validated **once** before kernel execution, not
during iteration:

- Operators check that sufficient ghost cells exist in the field collection
- Invalid configurations throw clear error messages
- No bounds checking inside the hot loop

Memory Layout
-------------

CPU operators use AoS, and GPU operators use SoA memory layouts:

- **CPU**: DOF components contiguous ``[ux0, uy0, uz0, ux1, uy1, uz1, ...]``
- **GPU**: Spatial indices contiguous ``[ux0, ux1, ..., uy0, uy1, ..., uz0, uz1, ...]``

The kernel implementations may differ between CPU and GPU as they may be
optimized for the respective target architecture.

GenericLinearOperator Design
============================

Purpose
-------

Flexible stencil-based convolution operator that can represent any linear
differential operator as a sparse stencil.

Design
------

- Takes a list of coefficient arrays, one per output component per stencil point
- Stencil defined by relative offsets from the current position
- Supports both forward (apply) and adjoint (transpose) operations

Ghost Requirements
------------------

The forward and transpose operations have different ghost requirements because
the stencil direction is reversed:

**Forward (apply)**: reads at position ``p + s`` (stencil offset s)

- Left ghosts: ``max(-offset, 0)`` in each dimension
- Right ghosts: ``max(stencil_shape - 1 + offset, 0)`` in each dimension

**Transpose**: reads at position ``p - s`` (reversed stencil direction)

- Left ghosts: ``max(stencil_shape - 1 + offset, 0)`` in each dimension (swapped from apply's right)
- Right ghosts: ``max(-offset, 0)`` in each dimension (swapped from apply's left)

**Example**: For a stencil with offset ``[0, 0, 0]`` and shape ``[2, 2, 2]``:

- Apply: left=0, right=1 (reads ahead)
- Transpose: left=1, right=0 (gathers from behind)

Usage
-----

.. code-block:: python

    # Create from FEMGradientOperator coefficients
    grad_op = muGrid.FEMGradientOperator(3, grid_spacing)
    conv_op = muGrid.GenericLinearOperator([0, 0, 0], grad_op.coefficients)

    # Apply gradient
    conv_op.apply(nodal_field, quadrature_field)

    # Apply divergence (transpose)
    conv_op.transpose(quadrature_field, nodal_field, quad_weights)

FEMGradientOperator Design
==========================

Purpose
-------

Computes gradients at quadrature points from nodal displacements using linear
finite element (P1) shape functions.

Mathematical Formulation
------------------------

For linear FEM on structured grids:

- **2D**: Each pixel decomposed into 2 triangles, 2 quadrature points
- **3D**: Each voxel decomposed into 5 tetrahedra (Kuhn triangulation), 5 quadrature points

The gradient at quadrature point q is:

.. math::

    \frac{\partial u}{\partial x_i} = \sum_I B_q[i,I] \times u_I / h_i

where :math:`B_q` contains shape function gradients for the element containing q.

Shape Function Gradients
------------------------

**2D (2 triangles per pixel):**

.. code-block:: text

    B_2D[triangle][dim][node]
    Triangle 0: nodes 0,1,2 (lower-left)
    Triangle 1: nodes 1,2,3 (upper-right)

**3D (5 tetrahedra per voxel):**

.. code-block:: text

    B_3D[tet][dim][node]
    Tet 0: central tetrahedron (weight 1/3)
    Tets 1-4: corner tetrahedra (weight 1/6 each)

Ghost Requirements
------------------

**Forward (apply): Gradient - nodal -> quadrature**

- Input (nodal field): 1 ghost cell on **right** in each dimension
- Output (quadrature field): no ghosts needed
- Iteration: over elements [0, n-2], reads nodal values at element corners
  [ix, ix+1] x [iy, iy+1] x [iz, iz+1]

**Transpose (transpose): Divergence - quadrature -> nodal**

- Input (quadrature/stress field): 1 ghost cell on **left** in each dimension (for periodic BC)
- Output (nodal field): no specific ghost requirement
- Iteration: over elements [0, n-2], accumulates contributions to nodal values
  at element corners

The ghost requirement swap occurs because:

- Forward reads "ahead" (node ix+1 when at element ix) -> needs right ghosts
- Transpose gathers from elements "behind" (element ix-1 contributes to node ix)
  -> needs left ghosts for periodic wraparound

LaplaceOperator Design
======================

Purpose
-------

Optimized discrete Laplacian for Poisson-type problems.

Mathematical Formulation
------------------------

Standard finite difference Laplacian:

.. math::

    \nabla^2 u \approx \frac{u[i+1] + u[i-1] - 2u[i]}{h^2} + \ldots

**2D (5-point stencil):**

.. code-block:: text

         1
       1 -4 1
         1

**3D (7-point stencil):**

.. code-block:: text

    Center: -6, Neighbors: +1 (x6)

Ghost Requirements
------------------

- 1 ghost cell in each direction (for accessing neighbors)

Properties
----------

- Self-adjoint: transpose equals forward apply
- Negative semi-definite (positive semi-definite with ``-scale``)
- Configurable scale factor for grid spacing and sign conventions

Usage
-----

.. code-block:: python

    # Create Laplacian with scale = -1/h^2 for positive-definite form
    laplace = muGrid.LaplaceOperator3D(scale=-1.0 / h**2)
    laplace.apply(u, laplacian_u)

IsotropicStiffnessOperator Design
=================================

Purpose
-------

Fused kernel computing ``force = K @ displacement`` for isotropic linear elastic
materials, avoiding explicit assembly of the stiffness matrix K.

Mathematical Formulation
------------------------

For isotropic materials, the element stiffness matrix decomposes as:

.. math::

    K_e = 2\mu G + \lambda V

where:

- :math:`G = \sum_q w_q B_q^T I' B_q` - geometry matrix (shear stiffness)
- :math:`V = \sum_q w_q (B_q^T m)(m^T B_q)` - volumetric coupling matrix
- :math:`I' = \text{diag}(1, 1, 1, 0.5, 0.5, 0.5)` - Voigt scaling for strain energy
- :math:`m = [1, 1, 1, 0, 0, 0]^T` - trace selector vector

This reduces memory from O(N x DOF^2) for full K to O(N x 2) for spatially-varying
materials.

Element Decomposition
---------------------

**2D:** 2 triangles per pixel

- Quadrature weights: ``[0.5, 0.5]``

**3D:** 5 tetrahedra per voxel (Kuhn triangulation)

- Quadrature weights: ``[1/3, 1/6, 1/6, 1/6, 1/6]``

Field Requirements
------------------

All fields use **node-based indexing** with the same grid dimensions:

.. list-table::
   :header-rows: 1
   :widths: 30 25 22 23

   * - Field
     - Dimensions
     - Left Ghosts
     - Right Ghosts
   * - Displacement
     - (nx, ny, nz)
     - 1
     - 1
   * - Force
     - (nx, ny, nz)
     - 1
     - 1
   * - Material (lam, mu)
     - (nx, ny, nz)
     - 1
     - 1

**Key design principle**: The kernel does not distinguish between periodic and
non-periodic boundary conditions. Ghost cell communication (via
``CartesianDecomposition::communicate_ghosts()``) handles periodicity:

- For **periodic BC**: ghost cells contain copies from the opposite boundary
- For **non-periodic BC**: ghost cells are filled with appropriate boundary values

This unified approach simplifies the kernel implementation and enables consistent
behavior across CPU and GPU.

Gather Pattern
--------------

The kernel iterates over **interior nodes** and gathers contributions from all
neighboring elements:

**2D (4 elements per node):**

.. code-block:: text

    Element offsets and local node indices:
      (-1,-1) -> local node 3    (0,-1) -> local node 2
      (-1, 0) -> local node 1    (0, 0) -> local node 0

**3D (8 elements per node):**

.. code-block:: text

    Element offsets and local node indices:
      (-1,-1,-1) -> local node 7    (0,-1,-1) -> local node 6
      (-1, 0,-1) -> local node 5    (0, 0,-1) -> local node 4
      (-1,-1, 0) -> local node 3    (0,-1, 0) -> local node 2
      (-1, 0, 0) -> local node 1    (0, 0, 0) -> local node 0

Iteration Bounds
----------------

The kernel iterates over all interior nodes (indices 0 to n-1):

.. code-block:: cpp

    // Iterate over all nodes - ghost communication handles periodicity
    for (ix = 0; ix < nnx; ix++):
        for (iy = 0; iy < nny; iy++):
            // Compute force at node (ix, iy)

The kernel reads from ghost cells at positions -1 (left) and n (right), which
must be populated before calling the operator. For periodic BC, these contain
copies from the opposite boundary. For non-periodic BC, these contain boundary
values (typically zero for homogeneous Dirichlet conditions).

Kernel Pseudocode
-----------------

.. code-block:: cpp

    for each node (ix, iy, iz) in [0, nnx) x [0, nny) x [0, nnz):
        f = [0, 0, 0]  // force accumulator

        for each neighboring element (8 in 3D):
            // Element indices can be -1 (left ghost) or nx (right ghost)
            ex, ey, ez = element indices relative to node
            local_node = which corner of element is this node

            // Material fields are node-indexed, read from element position
            // (element at ex,ey,ez has material stored at node ex,ey,ez)
            lam = material_lambda[ex, ey, ez]
            mu = material_mu[ex, ey, ez]

            // Gather displacement from all element nodes (may read ghosts)
            u = [displacement at each element corner node]

            // Compute stiffness contribution
            for each DOF d:
                row = local_node * NB_DOFS + d
                f[d] += sum_j (2*mu * G[row,j] + lam * V[row,j]) * u[j]

        force[ix, iy, iz] = f

Memory Layout Details
=====================

CPU Layout (Array of Structures)
--------------------------------

DOF components are contiguous in memory:

.. code-block:: text

    Memory: [ux0, uy0, uz0, ux1, uy1, uz1, ux2, uy2, uz2, ...]

Strides:

.. code-block:: cpp

    disp_stride_d = 1;                    // DOF components contiguous
    disp_stride_x = NB_DOFS;              // = 3 for 3D
    disp_stride_y = NB_DOFS * nx;
    disp_stride_z = NB_DOFS * nx * ny;

GPU Layout (Structure of Arrays)
--------------------------------

Spatial indices are contiguous for coalesced memory access:

.. code-block:: text

    Memory: [ux0, ux1, ux2, ..., uy0, uy1, uy2, ..., uz0, uz1, uz2, ...]

Strides:

.. code-block:: cpp

    disp_stride_x = 1;                    // Spatial x contiguous
    disp_stride_y = nx;
    disp_stride_z = nx * ny;
    disp_stride_d = nx * ny * nz;         // DOF components separated

Data Pointer Offset
-------------------

Pointers are offset to account for ghost cells:

.. code-block:: cpp

    const Real* disp_data = displacement.data() + ghost_offset;
    // Now index 0 = first interior node, index -1 = left ghost

Operator File Structure
=======================

.. code-block:: text

    src/libmugrid/operators/
    ├── linear.hh              # Base LinearOperator class
    ├── generic.hh/.cc         # GenericLinearOperator
    ├── laplace_2d.hh/.cc      # LaplaceOperator2D
    ├── laplace_3d.hh/.cc      # LaplaceOperator3D
    ├── fem_gradient_2d.hh/.cc # FEMGradientOperator2D
    ├── fem_gradient_3d.hh/.cc # FEMGradientOperator3D
    └── solids/
        ├── isotropic_stiffness_2d.hh/.cc  # IsotropicStiffnessOperator2D
        ├── isotropic_stiffness_3d.hh/.cc  # IsotropicStiffnessOperator3D
        └── isotropic_stiffness_gpu.cc     # GPU kernels

Testing
=======

Tests are in ``tests/python_isotropic_stiffness_operator_tests.py``:

- ``test_compare_with_generic_operator``: Compares fused kernel against explicit B^T C B
- ``test_unit_impulse_*``: Verifies response to unit displacement at specific nodes
- ``test_symmetry``: Verifies K is symmetric
- ``ValidationGuardTest*``: Verifies proper error handling for invalid configurations
- ``GPUUnitImpulseTest``: Verifies GPU kernel matches CPU kernel output

With node-based indexing, all tests compare the full output (all nodes), not
just interior nodes.