Skip to content

ADAM FNL Library — src/lib/fnl

The FNL backend is ADAM's OpenACC GPU acceleration layer, built on top of the FUNDAL GPU memory management library. It follows a consistent two-tier pattern for every subsystem: a wrapper object that extends the corresponding common CPU class and manages device-resident arrays, paired with a kernel module that contains OpenACC-decorated device subroutines implementing the actual computation.

No physics or algorithmic logic is duplicated — all equations, coefficients, and data structures are defined once in src/lib/common and mirrored to the GPU by the FNL layer.

The aggregate entry point adam_fnl_library re-exports the entire FNL API together with adam_common_library; a single use adam_fnl_library statement in application code exposes both layers.


Contents


Design and memory model

CPU vs GPU array layout

The common library stores field data in Fortran column-major order with variables as the fastest-varying index:

CPU:  q_cpu(nv, ni, nj, nk, nb)   — stride-1 on nv

On the GPU the block index is moved to the front so that threads mapped to adjacent blocks access adjacent memory locations, improving warp-level coalescing:

GPU:  q_gpu(nb, ni, nj, nk, nv)   — blocks contiguous

The transposition between layouts is performed by copy_transpose_cpu_gpu / copy_transpose_gpu_cpu in adam_fnl_field_object and their device kernels in adam_fnl_field_kernels. This copy happens only at I/O and AMR update boundaries; during time integration the GPU layout is used exclusively.

FUNDAL integration

GPU memory allocation and host-device transfers route through FUNDAL utilities:

UtilityPurpose
dev_allocAllocate device array
dev_assign_to_deviceCopy host array to device
dev_memcpy_to_deviceRaw device-to-device copy
DEVICEVAR(array) macroMark array as already device-resident (suppresses implicit copy)

The DEVICEVAR macro is defined in fundal.H and appears at the top of every kernel file. It is the primary mechanism by which the compiler is told that a given array lives on the device, preventing spurious host-copy insertion by the OpenACC runtime.

IB solid masking

Every kernel that modifies field variables carries an implicit guard: cells inside a solid body (phi_gpu(b,i,j,k,all_solids) < 0) are skipped. This masking is applied at the innermost loop level so that it incurs no branch divergence in fluid-only regions.


Field

adam_fnl_field_object — GPU field wrapper

Extends field_object (common). Holds all device-side arrays needed for field operations and provides the host-device transfer interface.

Device arrays:

ArrayShapePurpose
q_gpu(nb, ni, nj, nk, nv)Primary field — conservative variables
q_t_gpu(nv, ni, nj, nk, nb)Transposed scratch — used during CPU↔GPU copies
x_cell_gpu, y_cell_gpu, z_cell_gpu(nb, ni, nj, nk)Cell centroid coordinates
dxyz_gpu(nb, 3)Block mesh spacing (dx, dy, dz)
fec_1_6_array_gpu(nb, 26)Face enumeration codes for IB ghost-cell lookup

Key methods:

MethodPurpose
initializeAllocate all device arrays via dev_alloc
copy_cpu_gpuTransfer q, coordinate arrays, and maps to device
copy_transpose_cpu_gpu(nv, q_cpu, q_gpu)Transpose and copy q_cpu(nv,…,nb)q_gpu(nb,…,nv)
copy_transpose_gpu_cpu(nv, q_gpu, q_cpu)Inverse transpose: q_gpu(nb,…,nv)q_cpu(nv,…,nb)
update_ghost_local_gpuApply intra-rank ghost-cell updates entirely on device
update_ghost_mpi_gpuPack send buffer on device, perform MPI exchange, unpack on device
compute_q_gradient(b, ivar, q_gpu, gradient)AMR refinement criterion: `max

adam_fnl_field_kernels — field device kernels

All routines carry !$acc parallel loop independent and use DEVICEVAR on every device pointer argument.

KernelPurpose
compute_q_gradient_devCentred-difference gradient magnitude with reduction(max:)
compute_normL2_residuals_devL2 norm √(Σ dq²) per variable with reduction(+:)
copy_transpose_gpu_cpu_devTranspose (nb,ni,nj,nk,nv)(nv,ni,nj,nk,nb) on device
populate_send_buffer_ghost_gpu_devPack ghost-cell values into MPI send buffer; supports 1-cell and 8-cell AMR averaging
receive_recv_buffer_ghost_gpu_devUnpack MPI receive buffer into ghost cells
update_ghost_local_gpu_devApply intra-rank block-to-block ghost updates; supports AMR coarse↔fine averaging

WENO reconstruction

adam_fnl_weno_object — GPU WENO coefficient wrapper

Extends weno_object (common). The CPU object computes all WENO coefficients once during initialisation; the FNL object mirrors them to device memory and holds the ROR (Reduced-Order Reconstruction) tables used near solid boundaries.

Device arrays:

ArrayShapePurpose
a_gpu(2, 0:S-1, S)Optimal WENO weights per sub-stencil and face
p_gpu(2, 0:S-1, 0:S-1, S)Polynomial reconstruction coefficients
d_gpu(0:S-1, 0:S-1, 0:S-1, S)Smoothness indicator coefficients
ror_schemes_gpu(:)ROR fallback scheme orders near solid walls
ror_ivar_gpu(:)Variable indices checked by ROR
cell_scheme_gpu(nb, ni, nj, nk)Per-cell effective reconstruction order

adam_fnl_weno_kernels — WENO device kernels

KernelDirectivePurpose
weno_reconstruct_upwind_dev(S, a, p, d, zeps, V, VR)!$acc routine seqReconstruct left (VR(1)) and right (VR(2)) interface values from stencil V

The reconstruction follows the standard three-step algorithm:

  1. Compute S polynomial reconstructions from overlapping sub-stencils
  2. Compute smoothness indicators from second-derivative sums
  3. Weight and convolve: VR(f) = Σ_k w(f,k) · VP(f,k)

!$acc routine seq marks the procedure as callable from within a parallel region without launching a new kernel — one thread per stencil, invoked inside the outer loop over cells.


Runge-Kutta integration

adam_fnl_rk_object — GPU RK stage manager

Extends rk_object (common). Manages stage storage on device and drives the per-stage updates.

Device arrays:

ArrayShapePurpose
q_rk_gpu(nb, ni, nj, nk, nv, nrk)Stage values (1 stage for low-storage, nrk for SSP)
alph_gpu(nrk, nrk)SSP alpha coefficients
beta_gpu(nrk)SSP beta coefficients
gamm_gpu(nrk)SSP gamma coefficients

Supported schemes:

SchemeStorage modeStages
RK_1, RK_2, RK_3Low-storage1 — overwrites q_gpu in place
RK_SSP_22, RK_SSP_33Multi-stage2 / 3
RK_SSP_54Multi-stage5

Key methods:

MethodPurpose
initialize(rk, nb, ngc, ni, nj, nk, nv)Allocate q_rk_gpu sized to scheme requirements
initialize_stages(q_gpu)Broadcast q_gpu into all stage slots
assign_stage(s, q_gpu, phi_gpu)Copy q_gpu into stage s, skipping solid cells
compute_stage(s, dt, phi_gpu)Accumulate stages 1…s−1 into stage s (SSP)
compute_stage_ls(s, dt, phi_gpu, dq_gpu, q_gpu)Low-storage update: q = ark·q_n + brk·q + dt·crk·dq
update_q(s, dt, phi_gpu)Final assembly: q += dt·beta(s)·q_rk(:,:,:,:,:,s)

adam_fnl_rk_kernels — RK device kernels

All kernels carry !$acc parallel loop independent and mask solid cells via phi_gpu.

KernelPurpose
rk_assign_stage_devq_rk(:,s) ← q_gpu (fluid cells only)
rk_initialize_stages_devq_rk(:,all_s) ← q_gpu
rk_compute_stage_devq_rk(:,s) += dt·α(s,ss)·q_rk(:,ss) for ss = 1…s−1
rk_compute_stage_ls_devq = ark·q_n + brk·q + dt·crk·dq (low-storage)
rk_update_q_devq += dt·β(s)·q_rk(:,s) for s = 1…nrk

Immersed boundary method

adam_fnl_ib_object — GPU IB wrapper

Extends ib_object (common). Manages the signed-distance field phi_gpu on device and drives the eikonal enforcement cycle.

Device arrays:

ArrayShapePurpose
phi_gpu(nb, ni, nj, nk, n_solids+1)Signed-distance function; last slice holds max over all solids
q_bcs_vars_gpu(:,:)Boundary condition state variables per solid

The sign convention: phi < 0 inside the solid (ghost region), phi > 0 in the fluid.

Key methods:

MethodPurpose
initialize(ib, field_gpu)Allocate phi_gpu and q_bcs_vars_gpu; copy BCS data to device
evolve_eikonal(dq_gpu, q_gpu)Advance eikonal equation inside solid: q -= ∇φ·(q_bc − q)
invert_eikonal(q_gpu)Enforce wall BC at solid surface (φ > 0): reflect momentum

Wall BC modes applied by invert_eikonal:

  • BCS_VISCOUS (no-slip): (u, v, w) → (−u, −v, −w)
  • BCS_EULER (inviscid): u → u − 2(u·n̂)n̂

adam_fnl_ib_kernels — IB device kernels

KernelPurpose
compute_phi_analytical_sphere_devφ = −(‖x − xc‖ − R) — negative inside sphere
compute_phi_all_solids_devφ_all = max(φ₁, φ₂, …, φ_ns) — union of all solids
compute_eikonal_dq_phi_devGradient-weighted residual: `dq =
evolve_eikonal_q_phi_devq -= dq inside solid (φ > 0)
invert_eikonal_q_phi_devMomentum reflection at surface (BCS_VISCOUS or BCS_EULER)
move_phi_devLevel-set advection: ∂φ/∂t = −v·∇φ for moving bodies
reduce_cell_order_phi_devLower reconstruction order in cells adjacent to solid surface

Communication maps

adam_fnl_maps_object — GPU maps wrapper

Extends maps_object (common). Mirrors all communication index tables to device memory so that ghost-cell packing and unpacking happen entirely on the GPU, eliminating CPU staging for MPI buffers.

Device arrays:

ArrayColumnsContent
local_map_ghost_cell_gpu9(b_src, b_dst, i_src, j_src, k_src, i_dst, j_dst, k_dst, mode)
comm_map_send_ghost_cell_gpu7(b_src, i, j, k, v_offset, buf_idx, mode)
comm_map_recv_ghost_cell_gpu6(buf_idx, b_dst, i, j, k, v_offset)
send_buffer_ghost_gpu1D packed MPI send staging buffer
recv_buffer_ghost_gpu1D packed MPI receive staging buffer
local_map_bc_crown_gpuBoundary condition crown ghost-cell map

The mode column distinguishes two cases:

  • mode = 1 — one-to-one cell correspondence (same refinement level)
  • mode = 8 — eight-cell average (fine block → coarse block at AMR interface)

Key methods:

MethodPurpose
initialize(maps)Initialise and call copy_cpu_gpu
copy_cpu_gpu(verbose)Transfer all map arrays from CPU to device via dev_assign_to_device

Finite difference/volume operators

adam_fnl_fdv_operators_library — device-callable spatial operators

Provides the same operators as adam_fdv_operators_library (common) in a form callable from within OpenACC parallel regions. All routines carry !$acc routine seq — no internal parallelism, one thread per cell.

Available operators:

OperatorFD centredFV centred
Gradient ∇qcompute_gradient_fd_centered_devcompute_gradient_fv_centered_dev
Divergence ∇·qcompute_divergence_fd_centered_devcompute_divergence_fv_centered_dev
Curl ∇×qcompute_curl_fd_centered_devcompute_curl_fv_centered_dev
Laplacian ∇²qcompute_laplacian_fd_centered_devcompute_laplacian_fv_centered_dev

Each routine accepts a stencil half-width s and the local mesh spacing dxyz, allowing the accuracy order to be selected at call time without recompilation.


MPI handler

adam_fnl_mpih_object — FUNDAL MPI alias

fortran
use :: fundal_mpih_object, only : mpih_fnl_object => mpih_object

A direct re-export of FUNDAL's MPI handler under the FNL-namespaced type alias mpih_fnl_object. Provides rank/size queries, rank-prefixed console output, and timing utilities. No FNL-specific extensions are needed because FUNDAL's handler already covers GPU-aware MPI requirements.


Module summary

ModuleRoleExtends
adam_fnl_libraryAggregate entry point
adam_fnl_field_objectGPU field wrapper + host↔device transferfield_object
adam_fnl_field_kernelsGradient, L2 norm, ghost-cell pack/unpack
adam_fnl_weno_objectGPU WENO coefficient mirror + ROR tablesweno_object
adam_fnl_weno_kernelsUpwind WENO reconstruction (!$acc routine seq)
adam_fnl_rk_objectGPU RK stage storage and update dispatchrk_object
adam_fnl_rk_kernelsStage assign, accumulate, low-storage, final update
adam_fnl_ib_objectGPU distance field + eikonal BC wrapperib_object
adam_fnl_ib_kernelsEikonal evolution, sphere distance, momentum inversion
adam_fnl_maps_objectGPU communication index tables + MPI buffer stagingmaps_object
adam_fnl_mpih_objectFUNDAL MPI handler alias
adam_fnl_fdv_operators_libraryDevice-callable FD/FV spatial operators

Copyrights

ADAM is released under the GNU Lesser General Public License v3.0 (LGPLv3).

Copyright (C) Andrea Di Mascio, Federico Negro, Giacomo Rossi, Francesco Salvadore, Stefano Zaghi.