Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
 Range (minmax):  24.404 ns57.236 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     26.426 ns               GC (median):    0.00%
 Time  (mean ± σ):   26.787 ns ±  1.742 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▁▇█▇▆▅▄▂▂▃▁                                 ▁     ▂
  ▆▆▆▅▆▆▆███████████▅▅▆▅▄▄▃▃▃▃▁▁▁▃▁▁▃▁▃▁▃▄▁▃▃▁▃▁▄▃▄▅██████ █
  24.4 ns      Histogram: log(frequency) by time      35.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 585 evaluations per sample.
 Range (minmax):  200.768 ns363.687 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     210.615 ns                GC (median):    0.00%
 Time  (mean ± σ):   212.418 ns ±   7.592 ns   GC (mean ± σ):  0.00% ± 0.00%

           ▁▁▄▅█▇▆█▇▆▃▂                                       
  ▁▁▂▂▂▃▄▅▇██████████████▅▅▅▃▃▃▂▂▂▃▂▂▂▃▃▄▃▃▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁▁ ▃
  201 ns           Histogram: frequency by time          234 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.92 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 199 evaluations per sample.
 Range (minmax):  430.754 ns923.236 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     435.035 ns                GC (median):    0.00%
 Time  (mean ± σ):   439.544 ns ±  15.563 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▇██▆▅▃▁                                   ▁▃▃▃▃▂▁▁        ▂
  ██████████▅▇▆█▇▇▆▆▄▃▁▁▃▃▃▁▅▃▁▃▁▁▃▁▃▃▁▃▁▁▃▁▁▆█████████▇▇▇▆▅▅ █
  431 ns        Histogram: log(frequency) by time        487 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.354 μs 13.044 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.428 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.511 μs ± 430.971 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█                        ▁                               ▁
  ██▅▄▅▅▅▄▄▃▃▃▁▄▃▃▁▁▃▄▃▃▃▅███▆▄▆▇▆▅▄▅▃▁▃▄▄▃▅▄▁▃▄▅▅▃▄▅▄▆▃▅▅ █
  4.35 μs      Histogram: log(frequency) by time      6.99 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, we compare it to a representation as a banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  9.497 μs 31.659 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.518 μs                GC (median):    0.00%
 Time  (mean ± σ):   9.624 μs ± 960.724 ns   GC (mean ± σ):  0.00% ± 0.00%

                                                             
  ▂▂▂▂▁▁▁▁▁▁▁▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  9.5 μs          Histogram: frequency by time        17.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.11.0
  [9f78cca6] SummationByPartsOperators v0.5.92 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 199 evaluations per sample.
 Range (minmax):  419.578 ns595.382 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     423.910 ns                GC (median):    0.00%
 Time  (mean ± σ):   428.052 ns ±  13.445 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▃▇█                                                    
  ▂▄████▇▅▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▂▂▁▂▂▂▂▂▁▁▁▁▂▁▂▁▂▂▂▂▃▃▃▃▃▂▂▂▂▂▂ ▃
  420 ns           Histogram: frequency by time          473 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 14 evaluations per sample.
 Range (minmax):  983.929 ns 2.301 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     991.857 ns               GC (median):    0.00%
 Time  (mean ± σ):     1.002 μs ± 75.511 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                           ▁
  █▃▁▃▇▁▁▁▃▁▁▃▅▄▅▁▁▃▃▃▁▁▃▃▁▁▁▁▄▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅█ █
  984 ns        Histogram: log(frequency) by time      1.58 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.006 μs 10.141 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.080 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.130 μs ± 292.073 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂▆█                                                   ▁▁  ▂
  ███▆█▇▅▁▃▁▃▅▄▆▅▄▃▁▃▁▃▁▁▃▃▁▁▁▁▁▁▁▃▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▃▇████ █
  5.01 μs      Histogram: log(frequency) by time      6.53 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.018 μs 20.844 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.056 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.146 μs ± 577.302 ns   GC (mean ± σ):  0.00% ± 0.00%

  █    ▁                                                ▁▁   ▂
  █▅▇█▁▁▁▁▁▁▄▅▅▆▅▃▁▃▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇███▆ █
  7.02 μs      Histogram: log(frequency) by time      9.22 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  140.452 μs208.410 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     141.645 μs                GC (median):    0.00%
 Time  (mean ± σ):   144.203 μs ±   5.349 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▄█▄▃▂▁▁▁                 ▃▄▃▂▁▁▁                            ▁
  ▅█████████▇▆▇▆▆▆▅▅▆▅▆▅▆▇▇█████████▇▇▆▇▆▇▆▅▅▅▅▆▅▅▆▆▆▆▅▅▄▄▅▅▄ █
  140 μs        Histogram: log(frequency) by time        164 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.11.0
  [9f78cca6] SummationByPartsOperators v0.5.92 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (minmax):  46.609 ns86.552 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     47.581 ns               GC (median):    0.00%
 Time  (mean ± σ):   48.092 ns ±  2.136 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▄▇█▆                                                   
  ▃▇██████▆▃▄▄▃▃▂▂▂▂▂▂▂▂▁▂▂▁▁▂▁▂▁▂▁▁▂▁▁▂▁▁▂▂▁▁▂▂▂▂▃▃▃▂▂▂▂▂▂ ▃
  46.6 ns         Histogram: frequency by time        56.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 229 evaluations per sample.
 Range (minmax):  326.459 ns642.598 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     382.199 ns                GC (median):    0.00%
 Time  (mean ± σ):   384.813 ns ±  21.246 ns   GC (mean ± σ):  0.00% ± 0.00%

                     ▁▄▇▇▇▇██▅▅▅▂▁                            
  ▁▁▂▂▁▁▁▁▁▂▃▄▄▄▄▅▆▇▇█████████████▇▇▆▅▅▄▅▄▄▄▃▃▂▃▂▂▂▂▂▁▁▁▁▁▁▁▁ ▄
  326 ns           Histogram: frequency by time          453 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.131 μs  3.281 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.149 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.163 μs ± 106.134 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂ ▂▂▂▂▂▂▁▂▁▁▁▂▂▂▂▂▁▁▁▂▂▁▁▁▁▁▂▁▂▁▁▁▁▁▁▂▁▂▁▂▂▂▂▂▂▁▁▁▂▁▁▂▁▁▂▂ ▂
  1.13 μs         Histogram: frequency by time           2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.274 μs  3.440 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.282 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.296 μs ± 103.218 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▃▄▁█▁▁▃▁▃▃▁▅▃▅▃▄▁▃▁▁▁▁▁▁▃▁▁▃▁▁▁▁▁▁▄▃▄▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▇ █
  1.27 μs      Histogram: log(frequency) by time      2.09 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.494 μs  6.317 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.609 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.641 μs ± 193.466 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▅█                                                   
  ▂▃▄███▅▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▁▂▂▂▁▁▂▂▂▁▂▁▁▁▁▂▁▁▁▁▂▂▂▂▂▂▂▂ ▃
  2.49 μs         Histogram: frequency by time         3.6 μs <

 Memory estimate: 240 bytes, allocs estimate: 5.
D_full
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.286 μs 14.104 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.336 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.410 μs ± 361.768 ns   GC (mean ± σ):  0.00% ± 0.00%

   █                                                          
  ██▂▂▂▂▂▂▁▁▁▂▂▂▂▂▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▂▂▂▂ ▂
  6.29 μs         Histogram: frequency by time        8.09 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 545 evaluations per sample.
 Range (minmax):  208.719 ns277.639 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     211.607 ns                GC (median):    0.00%
 Time  (mean ± σ):   213.574 ns ±   5.622 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▂▃▅▇█▇▆▄▃▂                                  ▁▃▄▄▃▂▁       ▂
  ▅▆▇████████████████▇▆▅▆▄▁▃▁▃▃▃▁▃▁▁▁▁▃▁▃▁▃▁▁▁▅▇██████████▇▇▆ █
  209 ns        Histogram: log(frequency) by time        230 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample.
 Range (minmax):  518.932 ns 10.363 μs   GC (min … max): 0.00% … 93.07%
 Time  (median):     538.126 ns                GC (median):    0.00%
 Time  (mean ± σ):   544.689 ns ± 100.300 ns   GC (mean ± σ):  0.18% ±  0.93%

         ▂▄▇██▅▂                                               
  ▁▁▁▁▂▄▆████████▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▂▂▂▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  519 ns           Histogram: frequency by time          616 ns <

 Memory estimate: 32 bytes, allocs estimate: 1.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.986 μs41.287 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     14.076 μs               GC (median):    0.00%
 Time  (mean ± σ):   14.226 μs ±  1.125 μs   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▁▅▅█▁▁▁▁▁▁▃▅▅▁▄▁▁▁▁▁▃▁▁▁▁▁▃▁▁▁▃▄▄▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▅█ █
  14 μs        Histogram: log(frequency) by time      22.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 550 evaluations per sample.
 Range (minmax):  209.464 ns294.169 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     212.305 ns                GC (median):    0.00%
 Time  (mean ± σ):   214.247 ns ±   5.661 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▂▆█                                                
  ▂▂▂▃▆████▇▅▄▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▂▁▂▁▁▁▁▁▁▁▂▁▁▁▁▁▂▂▂▃▃▄▄▃▃▂▂▂▂▂▂▂ ▃
  209 ns           Histogram: frequency by time          231 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 193 evaluations per sample.
 Range (minmax):  505.715 ns 2.012 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     526.891 ns               GC (median):    0.00%
 Time  (mean ± σ):   534.301 ns ± 44.813 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▃▆▅▆▆██▇▅▃▃▁                                            
  ▂▄▆████████████▆▅▅▄▃▃▃▃▃▃▄▃▄▄▄▄▃▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂ ▄
  506 ns          Histogram: frequency by time          623 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.796 μs45.836 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.876 μs               GC (median):    0.00%
 Time  (mean ± σ):   14.066 μs ±  1.312 μs   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▄▄██▁▃▃▁▃▄▄▄▅▃▄▄▄▄▃▁▄▄▄▄▄▄▄▃▄▃▅▅▄▄▃▄▄▃▃▄▄▃▃▁▃▃▃▁▁▄▃▃▁▁▅█ █
  13.8 μs      Histogram: log(frequency) by time      22.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 478 evaluations per sample.
 Range (minmax):  223.199 ns323.408 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     227.077 ns                GC (median):    0.00%
 Time  (mean ± σ):   229.227 ns ±   6.948 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▂▅▇█▃▁                                                  
  ▂▂▃▄▆██████▆▅▃▃▃▂▂▂▂▂▂▂▂▁▂▁▁▂▁▁▁▁▁▁▁▁▂▁▂▂▂▂▃▃▃▄▃▃▃▃▂▂▂▂▂▂▂▂ ▃
  223 ns           Histogram: frequency by time          249 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  48.881 μs82.655 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     49.042 μs               GC (median):    0.00%
 Time  (mean ± σ):   49.552 μs ±  2.181 μs   GC (mean ± σ):  0.00% ± 0.00%

  █ ▁▂                                              ▂▃     ▂
  ██▆██▄▃▁▁▆▇▆▆▄▁▁▁▁▁▁▁▁▃▁▃▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▆███▇▇▅ █
  48.9 μs      Histogram: log(frequency) by time      57.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.348 μs 12.098 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.385 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.450 μs ± 334.400 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆                                                  ▂▁    ▁
  ██▃▇█▅▁▁▃▁▄▅▆▅▄▄▅▁▁▁▁▁▁▁▃▁▁▁▃▁▁▁▁▄▁▃▁▁▁▁▁▁▃▁▄▃▁▁▃▃▄███▇▆▅ █
  5.35 μs      Histogram: log(frequency) by time      6.89 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.18
  [09ab397b] StructArrays v0.7.3
  [9f78cca6] SummationByPartsOperators v0.5.92 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`