Linear Algebra ############## *µ*\Grid provides a set of linear algebra operations optimized for field data. These operations work directly on muGrid fields, avoiding the overhead of creating non-contiguous array views. The ``linalg`` module largely captures Level 1 BLAS functionality, providing the fundamental building blocks for iterative solvers and other numerical algorithms. Overview ******** The linear algebra operations are designed with two key principles: 1. **MPI correctness**: Operations that compute scalar results (like dot products and norms) iterate only over interior pixels, excluding ghost regions. This prevents double-counting of values that are duplicated across MPI ranks. 2. **Performance**: Operations that modify field data (like ``axpy`` and ``scal``) operate on the full buffer including ghost regions. This is efficient because the memory is contiguous, and ghost values are typically overwritten by subsequent ghost communication anyway. Available Operations ******************** The following operations are available in the ``muGrid.linalg`` namespace (C++) or as functions in the Python bindings: .. list-table:: Linear Algebra Operations :header-rows: 1 :widths: 25 35 40 * - Operation - Description - BLAS Equivalent * - ``vecdot(a, b)`` - Vector dot product: :math:`\sum_i a_i b_i` - SDOT / DDOT * - ``axpy(alpha, x, y)`` - :math:`y \leftarrow \alpha x + y` - SAXPY / DAXPY * - ``scal(alpha, x)`` - :math:`x \leftarrow \alpha x` - SSCAL / DSCAL * - ``axpby(alpha, x, beta, y)`` - :math:`y \leftarrow \alpha x + \beta y` - (extended BLAS) * - ``copy(src, dst)`` - :math:`\text{dst} \leftarrow \text{src}` - SCOPY / DCOPY * - ``norm_sq(x)`` - Squared norm: :math:`\sum_i x_i^2` - SNRM2² / DNRM2² * - ``axpy_norm_sq(alpha, x, y)`` - Fused: :math:`y \leftarrow \alpha x + y`, returns :math:`\|y\|^2` - (fused operation) Ghost Region Handling ********************* Understanding how ghost regions are handled is important for MPI-parallel codes: **Scalar-producing operations** (``vecdot``, ``norm_sq``, ``axpy_norm_sq``): These iterate only over interior pixels. Ghost values are excluded because they are duplicates of values owned by neighboring MPI ranks. The returned scalar is a local result that must be MPI-reduced if a global result is needed. **Field-modifying operations** (``axpy``, ``scal``, ``axpby``, ``copy``): These operate on the full buffer including ghost regions. This is more efficient than iterating only over interior pixels, and ghost values will be overwritten by subsequent ``communicate_ghosts()`` calls anyway. Usage Examples ************** C++ Usage --------- .. code-block:: cpp #include "linalg/linalg.hh" using namespace muGrid; using namespace muGrid::linalg; // Assuming fields a, b are TypedField // Dot product (interior only) Real dot = vecdot(a, b); // AXPY: y = 0.5 * x + y axpy(0.5, x, y); // Scale: x = 2.0 * x scal(2.0, x); // Combined scale and add: y = alpha * x + beta * y axpby(1.0, x, -1.0, y); // y = x - y // Copy: dst = src copy(src, dst); // Squared norm (interior only) Real norm2 = norm_sq(x); // Fused AXPY + norm: y = alpha * x + y, returns ||y||² Real new_norm2 = axpy_norm_sq(alpha, x, y); Python Usage ------------ The linear algebra operations are available through the ``muGrid.linalg`` module: .. code-block:: python import muGrid import muGrid.linalg as la # Create fields fc = muGrid.GlobalFieldCollection([64, 64]) x = fc.real_field("x") y = fc.real_field("y") # Initialize fields x.p[...] = 1.0 y.p[...] = 2.0 # Dot product (returns local scalar, not MPI-reduced) dot = la.vecdot(x, y) # AXPY: y = 0.5 * x + y la.axpy(0.5, x, y) # Scale: x = 2.0 * x la.scal(2.0, x) # Squared norm norm2 = la.norm_sq(x) MPI-Parallel Example -------------------- For MPI-parallel computations, scalar results must be reduced across ranks: .. code-block:: python from mpi4py import MPI import muGrid import muGrid.linalg as la # Create parallel decomposition comm = muGrid.Communicator(MPI.COMM_WORLD) decomp = muGrid.CartesianDecomposition( comm, nb_domain_grid_pts=[128, 128], nb_ghosts_left=[1, 1], nb_ghosts_right=[1, 1], ) x = decomp.real_field("x") y = decomp.real_field("y") # ... initialize fields ... # Local dot product local_dot = la.vecdot(x, y) # MPI reduce to get global dot product global_dot = MPI.COMM_WORLD.allreduce(local_dot, op=MPI.SUM) Fused Operations **************** The ``axpy_norm_sq`` function provides a fused operation that computes both an AXPY update and the squared norm of the result in a single pass through memory. This is more efficient than separate ``axpy`` + ``norm_sq`` calls: .. code-block:: python # Separate operations (less efficient): # - 2 reads of x, 2 reads of y, 1 write of y la.axpy(alpha, x, y) norm2 = la.norm_sq(y) # Fused operation (more efficient): # - 1 read of x, 1 read of y, 1 write of y norm2 = la.axpy_norm_sq(alpha, x, y) This optimization is particularly valuable in iterative solvers where memory bandwidth is often the limiting factor. GPU Support *********** All linear algebra operations support GPU fields (CUDA and ROCm). The operations automatically use GPU-optimized kernels when the input fields reside in device memory: .. code-block:: python import muGrid import muGrid.linalg as la if muGrid.has_gpu: # Create GPU field collection fc = muGrid.GlobalFieldCollection([64, 64], device="gpu") x = fc.real_field("x") y = fc.real_field("y") # Operations execute on GPU la.axpy(0.5, x, y) # GPU kernel dot = la.vecdot(x, y) # GPU reduction The GPU kernels are optimized for coalesced memory access and use efficient parallel reduction algorithms for scalar-producing operations. Integration with Solvers ************************ The linear algebra module provides the core operations used by iterative solvers in ``muGrid.Solvers``. For example, the conjugate gradient solver uses these operations internally: .. code-block:: python from muGrid.Solvers import conjugate_gradients # The CG solver uses vecdot, axpy, scal, etc. internally x = conjugate_gradients(comm, fc, b, x0, hessp=apply_operator) If you are implementing custom solvers, using the ``muGrid.linalg`` operations ensures efficient execution on both CPU and GPU, with proper handling of ghost regions for MPI parallelism.