Appearance
PRISM — FNL (OpenACC) Backend
PRISM is part of the ADAM framework, released under the GNU Lesser General Public License v3.0 (LGPLv3). Copyright (C) Andrea Di Mascio, Federico Negro, Giacomo Rossi, Francesco Salvadore, Stefano Zaghi.
The FNL backend (prism-fnl) is the GPU-accelerated implementation of PRISM using the FNL library (Fortran aNd LAPACK, OpenACC-based GPU framework). It inherits the full PRISM common infrastructure and extends it with GPU-resident arrays, OpenACC-parallelized FDV operators, and GPU-aware MPI communication via the FUNDAL library.
Source files
| File | Module | Description |
|---|---|---|
adam_prism_fnl.F90 | entry point | Parses CLI argument, delegates to prism%simulate |
adam_prism_fnl_object.F90 | adam_prism_fnl_object | Main type, all operators and integrators |
adam_prism_fnl_library.F90 | adam_prism_fnl_library | Barrel re-export of all FNL PRISM modules |
adam_prism_fnl_kernels.F90 | adam_prism_fnl_kernels | Standalone OpenACC device kernels |
adam_prism_fnl_coil_object.F90 | adam_prism_fnl_coil_object | GPU coil current source handler |
adam_prism_fnl_fWLayer_object.F90 | adam_prism_fnl_fWLayer_object | GPU far-wave layer handler |
adam_prism_fnl_external_fields_kernels.F90 | adam_prism_fnl_external_fields_kernels | GPU external field sub/add kernels |
Build modes
bash
FoBiS.py build -mode prism-fnl-nvf-oac # OpenACC with NVIDIA compiler
FoBiS.py build -mode prism-fnl-nvf-oac-debug # debug buildGPU memory layout — critical design point
The FNL backend uses the transposed array layout (nb, ni, nj, nk, nv) for all device arrays, the inverse of the CPU layout (nv, ni, nj, nk, nb). This is required for coalesced memory access on NVIDIA GPUs, where adjacent threads execute the innermost loop index:
fortran
! CPU layout: (nv, ni, nj, nk, nb) — variable index innermost
q_cpu(v, i, j, k, b)
! GPU layout: (nb, ni, nj, nk, nv) — variable index outermost
q_gpu(b, i, j, k, v)With collapse(4) over (b, k, j, i), threads access q_gpu(b,i,j,k,v) with adjacent i values per warp, giving coalesced access in the ni dimension.
The copy_cpu_gpu / copy_gpu_cpu routines perform a full layout transposition on every host↔device transfer via field_gpu%copy_transpose_cpu_gpu / copy_transpose_gpu_cpu.
Type: prism_fnl_object
prism_fnl_object extends prism_common_object and adds:
GPU-specific ADAM library objects
| Member | Type | Description |
|---|---|---|
mpih_gpu | mpih_fnl_object | MPI handler with GPU device initialization |
field_gpu | field_fnl_object | Field object with GPU memory and transposed halo exchange |
ib_gpu | ib_fnl_object | Immersed Boundary with device-resident phi_gpu |
rk_gpu | rk_fnl_object | RK stage arrays on device (q_rk_gpu) |
weno_gpu | weno_fnl_object | WENO coefficients mirrored on device |
GPU-specific PRISM library objects
| Member | Type | Description |
|---|---|---|
coil_gpu | prism_fnl_coil_object | Coil J-vector and flag arrays on device (transposed layout) |
fwlayer_gpu | prism_fnl_fwlayer_object | fWLayer damping function f_gpu on device |
Device data pointers
All allocated via FUNDAL dev_alloc in allocate_gpu:
| Pointer | Shape (GPU transposed) | Description |
|---|---|---|
q_gpu | (nb, ni, nj, nk, nv) | Field cell-centred variables (points into field_gpu%q_gpu) |
dq_gpu | (nb, ni, nj, nk, nv) | Residuals right-hand side |
flxyz_c_gpu | (nb, 3, 3, ni, nj, nk, nv) | Cell-centre fluxes with ±-decomposition |
flx_f_gpu | (nb, 0:ni, nj, nk, nv) | X-face fluxes |
fly_f_gpu | (nb, ni, 0:nj, nk, nv) | Y-face fluxes |
flz_f_gpu | (nb, ni, nj, 0:nk, nv) | Z-face fluxes |
curl_gpu | (nb, ni, nj, nk, nv) | Curl auxiliary fields |
divergence_gpu | (nb, ni, nj, nk, nv) | Divergence auxiliary fields and Poisson buffer |
Procedure pointers
Same 9 procedure pointer interfaces as the CPU backend; all point to OpenACC-accelerated implementations operating on GPU arrays.
Initialization sequence (initialize)
mpih_gpu%initialize ← MPI init + CUDA device selection
initialize_common ← 34-step common setup (CPU side)
field_gpu%initialize ← mirror field metadata on device
ib_gpu%initialize ← copy IB phi to device
rk_gpu%initialize ← allocate q_rk_gpu stages
weno_gpu%initialize ← copy WENO coefficients to device
allocate_gpu ← allocate all device data pointers
coil_gpu%initialize ← allocate and transpose coil arrays
fwlayer_gpu%initialize ← allocate f_gpu
external_fields_initialize_dev ← copy external field data to device
set procedure pointers ← based on scheme_space / scheme_timeDispatch table
Spatial operator pointers
scheme_space | FDV pointer family | compute_residuals |
|---|---|---|
FD_CENTERED | *_fd variants | compute_residuals_fd_centered ✓ |
FV_CENTERED | *_fv variants | (not yet implemented) |
WENO | *_fv variants | (not yet implemented) |
Time integration pointer
physical_model | scheme_time | RK variant | integrate | Status |
|---|---|---|---|---|
| EM | BLANES_MOAN | — | integrate_blanesmoan | stub |
| EM | CFM | — | integrate_cfm | stub |
| EM | LEAPFROG | — | integrate_leapfrog | stub |
| EM | RK | RK_1/2/3 | integrate_rk_ls | stub |
| EM | RK | RK_SSP_22/33/54 | integrate_rk_ssp | active |
| EM | RK | RK_YOSHIDA | integrate_rk_yoshida | stub |
| PIC | LEAPFROG | PIC_LEAPFROG/RK | integrate_leapfrog_pic | stub |
The only fully implemented time integrator is SSP RK (integrate_rk_ssp).
OpenACC parallelization pattern
All FDV operator loops follow the same collapsed parallel pattern:
fortran
!$acc parallel loop independent gang vector collapse(4) DEVICEVAR(q_gpu,out_gpu,dxyz_gpu)
do b=1, blocks_number
do k=1, nk
do j=1, nj
do i=1, ni
! scalar kernel call
enddo
enddo
enddo
enddoThe DEVICEVAR clause (FUNDAL macro expanding to OpenACC present / deviceptr depending on the SDK version) asserts that the listed arrays are already resident on device. Stencil sub-arrays are declared private when needed to avoid false sharing between threads.
Methods
Data management
allocate_gpu
Allocates all device pointers using dev_alloc from FUNDAL. q_gpu is a pointer alias into field_gpu%q_gpu (already allocated by field_gpu%initialize).
copy_cpu_gpu
Transfers state from CPU to device after initial conditions or restart load:
field_gpu%copy_transpose_cpu_gpu— transposes and uploadsq,curl,divergence,fWLayer%ffield_gpu%copy_cpu_gpu— uploads grid metadata (dxyz, cell coordinates, BC maps)coil_gpu%copy_cpu_gpu— uploads coil amplitude, frequency, phase, J-vector (with transposition)
copy_gpu_cpu
Downloads state from device to CPU before I/O:
field_gpu%copy_transpose_gpu_cpu— downloads and un-transposesq,curl,divergence,fWLayer%fcoil_gpu%copy_gpu_cpu— downloads coil arrays
Simulation loop (simulate)
Mirrors the CPU simulate structure with GPU-aware adaptations:
initialize
if restart: load_restart_files → copy_cpu_gpu
else: set_initial_conditions (→ CPU IC + copy_cpu_gpu)
save_simulation_data (→ copy_gpu_cpu before save)
compute_energy (on device)
open residuals file
do
compute_dt ← OpenACC reduction on device
integrate ← dispatched GPU integrator
time += dt
save_simulation_data
compute_energy
if converged: exit
enddo
compute_energy_error
save final data
print initial/final energies and RMS errors
MPI + GPU finalizeMemory status is dumped for both CPU and GPU when save_memory_status = .true..
I/O
load_restart_files
Calls prism_common_object%load_restart_files (CPU read), then copy_cpu_gpu to upload to device.
save_residuals
Computes L2 residual norm on device via compute_normL2_residuals_dev, reduces with MPI_ALLREDUCE(SUM), and writes from rank 0. Called only when is_to_save(residuals_save).
save_simulation_data
Triggered by it_save, restart_save, or slice monitors. Calls update_ghost, then copy_gpu_cpu to bring data to CPU, then compute_auxiliary_fields, then saves XH5F files, restart files, and slices.
Ghost cell update (update_ghost)
Same 3-step async scheme as CPU but on GPU arrays:
| Step | Action |
|---|---|
| 1 | field_gpu%update_ghost_local_gpu — block-local halo fill on device |
| MPI | field_gpu%update_ghost_mpi_gpu — GPU-aware MPI halo exchange |
| 3 | apply_fwl_correction + set_boundary_conditions on device |
update_rk_ghost
Stub for future implementation of radiative BC ghost updates during RK stages.
Boundary conditions (set_boundary_conditions)
Iterates local_map_bc_crown_gpu (device-resident BC map) over ngc ghost crowns using a !$acc parallel loop independent gang vector directive. Currently implemented:
| BC type | Status | Implementation |
|---|---|---|
EXTRAPOLATION | Active | q_gpu(ghost) = q_gpu(interior) |
NEUMANN | Active | Mirror reflection formula |
DIRICHLET | Active | Set to zero |
Silver_Müller | Pending | "to be implemented" |
PERIODIC | Pending | "to be implemented" |
Sources — compute_coils_current
Delegates to the compute_coils_current_dev kernel in adam_prism_fnl_kernels via coil_gpu's device arrays. The GPU kernel (adam_prism_fnl_kernels.F90) applies the same smooth-start polynomial and branchless step-function arithmetic as the CPU backend, over a collapse(4) gang-vector loop. The optional gamma argument shifts the evaluation time for SSP RK stages.
Far-wave layer (apply_fwl_correction)
Computes per-face index ranges from fWLayer%C (layer cell width) and calls apply_fwl_correction_dev for each active face. The device kernel applies the impedance-matching correction to (alfa_D, beta_D, alfa_B, beta_B) component pairs using the fWLayer function f_gpu:
and similarly for the other three components (face-dependent sign s2).
Spatial discretization
compute_residuals_fd_centered
The only fully active space operator. Uses a !$acc parallel loop collapse(4) kernel that explicitly extracts 1D stencil arrays (qsx_y, qsx_z, etc.) as private temporaries, then calls the device-side curl kernel compute_curl_fd_centered_dev:
The stencil arrays are declared as local variables with static bounds at compile time (sized to fdv_half_stencils(1)) to be private in the OpenACC region.
Note: The NVF SDK 24.x/25.x does not support firstprivate with associate variables in !$acc regions; the grid scalars (blocks_number, ni, etc.) are passed via copyin instead.
FDV operator library
All operators loop collapse(4) with DEVICEVAR and call the same underlying scalar kernels as the CPU backend. The dxyz_gpu(b,:) array replaces the CPU's dxyz(:,b) due to the transposed layout. Unique to the FNL backend:
compute_divergence_fdimplements the finite-difference stencil inline using embedded coefficient tablesFD1_CC(s_max, s_max)supporting up to 10th-order schemes, avoiding a library call that would require complex array slicing across the transposed layout.compute_divergence_fvsimilarly inlines FV reconstruction usingcompute_reconstruction_r_fv_centeredon 1D stencil temporaries declaredprivate.- Both
divergencevariants take an extraovarparameter selecting the output slot indivergence_gpu(:,:,:,:,ovar). compute_laplacian_fvis fully implemented (unlike the CPU backend stub).
| Operator | FD | FV | Stencil half-width |
|---|---|---|---|
compute_curl | compute_curl_fd_centered | compute_curl_fv_centered | fdv_half_stencils(1) |
compute_derivative1 | compute_derivative1_fd_centered | compute_derivative1_fv_centered | fdv_half_stencils(1) |
compute_derivative2 | compute_derivative2_fd_centered | compute_derivative2_fv_centered | fdv_half_stencils(1) |
compute_derivative4 | compute_derivative4_fd_centered | — | fdv_half_stencils(1) |
compute_divergence | inline FD coefficients | inline FV reconstruction | fdv_half_stencils(1) |
compute_gradient | compute_gradient_fd_centered | compute_gradient_fv_centered | fdv_half_stencils(1) |
compute_laplacian | compute_laplacian_fd_centered | compute_laplacian_fv_centered | fdv_half_stencils(2) |
SSP RK time integration (integrate_rk_ssp)
The only fully implemented integrator:
if external_fields: sub_external_fields_dev
rk_gpu%initialize_stages(q_gpu)
for s = 1..nrk:
compute_coils_current(gamma=rk%gamm(s)) ← GPU coil kernel
rk_gpu%compute_stage(s, dt [,phi_gpu]) ← advance q_rk_gpu stage on device
compute_residuals(q_rk_gpu(:,:,:,:,:,s)) ← OpenACC FD curl kernel
if s==1: save_residuals
rk_gpu%assign_stage(s, dq_gpu [,phi_gpu]) ← store stage residuals
rk_gpu%update_q(dt, q_gpu [,phi_gpu]) ← final stage combination
update_rk_ghost(dt) ← (stub)
impose_div_free
if external_fields: add_external_fields_devcompute_dt
Same CFL formula as CPU (dt = CFL × Δx_min / evmax) but the minimum is computed via an !$acc parallel loop reduction(min:dxyz_min) directly on device, avoiding a device-to-host transfer. Result is then reduced across MPI ranks via MPI_ALLREDUCE(MIN).
compute_energy
Computes the field energy entirely on device using !$acc parallel loop reduction(+:energy) before reducing across MPI ranks. The energy sums 0.5 × |q|² over all interior cells for the D and B fields, exactly as in the CPU backend but without a copy to host.
compute_auxiliary_fields
Currently stubbed — divergence and curl computations for diagnostic fields are commented out pending stable compute_divergence_gpu signatures.
Divergence cleaning (impose_div_free / impose_ct_correction)
Same logic as CPU: calls impose_ct_correction(ivar=1) for D and/or ivar=4 for B when Poisson CT correction is enabled. The inner FLAIL multigrid solver (compute_smoothing_multigrid) is currently commented out pending a GPU port; the gradient correction loop is implemented with an !$acc parallel loop collapse(5) kernel.
GPU support modules
prism_fnl_coil_object
Manages device-resident coil data. initialize allocates device arrays via dev_alloc. copy_cpu_gpu manually transposes coil%J_vec(nv,i,j,k,nb) → J_vec_gpu(nb,i,j,k,nv) and coil%coil_flag(i,j,k,nb) → coil_flag_gpu(nb,i,j,k) before uploading.
prism_fnl_fwlayer_object
Stores f_gpu(nb,i,j,k,3) (three directional fWLayer function values) allocated on device. The apply_fwl_correction_dev kernel is direction-agnostic: the caller passes the appropriate index ranges and Barbas-notation field indices for the target face.
adam_prism_fnl_kernels
Standalone module containing public device kernels:
| Subroutine | Description |
|---|---|
compute_coils_current_dev | Coil current injection with smooth start transient |
compute_div_d_b_dev | Divergence of D and B fields |
compute_fluxes_convective_dev | Convective flux computation (WENO, device) |
compute_fluxes_difference_dev | Flux divergence + source subtraction |
set_bc_q_gpu_dev | Apply boundary conditions on device |
set_sir_dev | Initialize direction versors on device |
adam_prism_fnl_external_fields_kernels
Provides external_fields_initialize_dev, sub_external_fields_dev, and add_external_fields_dev for GPU-side external field handling in SSP RK integrations.
Implementation status summary
| Feature | Status |
|---|---|
| FD centered space operator | Active |
| FV centered space operator | Not implemented |
| WENO space operator | Not implemented |
| SSP RK time integration | Active |
| Leapfrog / Yoshida / Blanes-Moan / CFM | Stubbed (CPU logic commented out) |
| PIC integration | Stubbed |
| Silver-Müller BC | Pending |
| Periodic BC | Pending |
| FLAIL multigrid (divergence cleaning) | Partially ported |
| Auxiliary fields (curl/divergence diagnostic) | Pending |