Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (minmax):  26.964 ns91.576 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     27.831 ns               GC (median):    0.00%
 Time  (mean ± σ):   28.708 ns ±  3.561 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇▇█▇▆▄▁                      ▁▁▁                          ▂
  ███████▇█▇▄▄▅▄▃▄▁▁▁▃▃▁▁▁▃▁▁▄▇█████▇▆▆▅▇▆▇▆▇▇▇▇▇▇██▇█▇██▇▇ █
  27 ns        Histogram: log(frequency) by time      43.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 600 evaluations per sample.
 Range (minmax):  199.203 ns422.785 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     211.142 ns                GC (median):    0.00%
 Time  (mean ± σ):   215.281 ns ±  21.320 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▇                                                        
  ▂▃██▄▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▁▂▁▁▁▂▁▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  199 ns           Histogram: frequency by time          365 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.96-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 200 evaluations per sample.
 Range (minmax):  406.255 ns699.550 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     410.510 ns                GC (median):    0.00%
 Time  (mean ± σ):   414.479 ns ±  13.767 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▃█▇                                                     
  ▂▅████▅▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▂▂▁▁▂▂▁▁▁▁▁▁▁▁▂▂▃▃▃▃▃▂▂▂▂▂▂▂▂▂ ▃
  406 ns           Histogram: frequency by time          462 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.481 μs 20.671 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.583 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.633 μs ± 307.328 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▃▆█▇▆                                               ▁▁▁  ▂
  ▇███████▇▅▃▃▄▃▅▅▆▅▄▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▃▁▃▃▁▃▁▃▃▃▃▆█████ █
  4.48 μs      Histogram: log(frequency) by time      5.82 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, we compare it to a representation as a banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  9.587 μs 25.417 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.617 μs                GC (median):    0.00%
 Time  (mean ± σ):   9.728 μs ± 970.557 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▃▄▄█▅▁▁▁▁▁▁▃▄▄▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆ █
  9.59 μs      Histogram: log(frequency) by time      17.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.11.0
  [9f78cca6] SummationByPartsOperators v0.5.96-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 204 evaluations per sample.
 Range (minmax):  383.407 ns778.407 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     386.500 ns                GC (median):    0.00%
 Time  (mean ± σ):   390.988 ns ±  16.236 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▂█                                                          
  ▃██▅▃▃▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▂▂▁▂▂▁▂▂▂▂▁▂▂▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
  383 ns           Histogram: frequency by time          451 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 21 evaluations per sample.
 Range (minmax):  980.905 ns 2.310 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     991.333 ns               GC (median):    0.00%
 Time  (mean ± σ):     1.001 μs ± 63.727 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅█                                                        ▁ ▁
  ██▄▃██▁▁▁▁▁▁▁▅▅▄▄▃▃▁▁▁▃▁▃▁▁▁▁▃▁▁▄▄▃▁▁▃▁▁▁▄▃▁▃▁▃▁▁▁▁▁▁▃▁▄▆█ █
  981 ns        Histogram: log(frequency) by time      1.39 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.045 μs 10.262 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.116 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.168 μs ± 288.668 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂▆█                                                   ▁▁  ▂
  ███▅▇▆▄▃▁▁▁▄▅▆▅▅▅▃▃▁▁▃▁▁▁▁▁▁▁▃▄▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▅████ █
  5.04 μs      Histogram: log(frequency) by time      6.58 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.863 μs 15.741 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.887 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.956 μs ± 381.516 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                      ▁▁  ▁
  █▁▇█▁▁▁▁▃▁▃▄▅▅▆▄▄▁▁▄▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▄▇███ █
  6.86 μs      Histogram: log(frequency) by time      8.61 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  117.288 μs409.570 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     118.810 μs                GC (median):    0.00%
 Time  (mean ± σ):   122.809 μs ±  14.759 μs   GC (mean ± σ):  0.00% ± 0.00%

  ▇▅▃▂    ▁▃▄▃▂▁▁                                             ▂
  █████▇██████████▇█▆▆▆▆▇▇▆▆▆▆▅▆▅▅▁▄▅▃▅▅▅▄▅▅▆▄▃▅▄▃▃▃▄▁▅▄▄▅▃▃▅ █
  117 μs        Histogram: log(frequency) by time        185 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.11.0
  [9f78cca6] SummationByPartsOperators v0.5.96-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (minmax):  46.850 ns75.732 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     47.622 ns               GC (median):    0.00%
 Time  (mean ± σ):   48.188 ns ±  2.186 ns   GC (mean ± σ):  0.00% ± 0.00%

   █▅▃                                                    
  ▄█████▆▄▃▃▃▃▃▂▃▂▂▂▂▂▁▂▂▁▁▁▂▂▁▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▂▃▂▂▂▂▂▂▂▂▂▂ ▃
  46.9 ns         Histogram: frequency by time        57.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 228 evaluations per sample.
 Range (minmax):  323.057 ns577.781 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     373.018 ns                GC (median):    0.00%
 Time  (mean ± σ):   375.378 ns ±  19.999 ns   GC (mean ± σ):  0.00% ± 0.00%

                      ▂▃▂▄▆▆▇█▅▃▄▁                            
  ▁▁▁▂▂▂▂▂▂▂▂▂▃▃▅▆▇▇█████████████████▆▆▅▄▄▄▃▄▃▄▄▃▄▃▃▂▂▂▂▂▂▁▂▂ ▄
  323 ns           Histogram: frequency by time          432 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.097 μs 11.653 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.108 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.143 μs ± 359.227 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▁▇▇▁▃▁▃█▆▄▄▄▄▅▄▃▁▅▅▄▁▇▁▃▁▁▁▃▃▅▃▁▁▄▃▃▄▄▁▃▁▁▃▁▁▁▁▁▁▃▁▄▅▇▇ █
  1.1 μs       Histogram: log(frequency) by time      1.99 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.270 μs  7.338 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.279 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.299 μs ± 182.900 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▃▁▄▇▁▄▁▁▄▁▁▄▃▅▃▄▃▄▁▁▃▃▄▃▄▄▁▃▄▁▆▆▄▄▄▄▁▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▅▄█ █
  1.27 μs      Histogram: log(frequency) by time       2.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.446 μs  6.283 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.549 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.580 μs ± 185.102 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▁▅▇██▅▄▂▁                                           ▁    ▂
  ▅█████████▇▃▅▄▁▁▄▄▅▄▆▅▅▁▄▃▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▃▁▃▃▇███▇█ █
  2.45 μs      Histogram: log(frequency) by time      3.55 μs <

 Memory estimate: 240 bytes, allocs estimate: 5.
D_full
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.248 μs 16.222 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.288 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.358 μs ± 380.032 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇▅▂                                                    ▁▁  ▁
  ███▆█▆▃▁▁▃▁▁▁▅▆▅▅▄▄▃▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▁▃▁▃▁▁▁▃▁▄████ █
  6.25 μs      Histogram: log(frequency) by time      8.08 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 511 evaluations per sample.
 Range (minmax):  216.370 ns413.370 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     217.861 ns                GC (median):    0.00%
 Time  (mean ± σ):   220.103 ns ±   6.954 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▃▇█▆▄▄▃▁                                   ▃▄▃▃▂▁           ▂
  ▇████████▅▄▆███▇▆▄▄▁▃▁▁▁▁▃▃▁▁▃▁▁▁▃▁▁▁▁▄▁▁▃████████▇▇▅▆▆▆▆▆▇ █
  216 ns        Histogram: log(frequency) by time        239 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 181 evaluations per sample.
 Range (minmax):  583.569 ns 10.724 μs   GC (min … max): 0.00% … 91.53%
 Time  (median):     612.351 ns                GC (median):    0.00%
 Time  (mean ± σ):   618.870 ns ± 103.544 ns   GC (mean ± σ):  0.16% ±  0.92%

              ▂▄▆▇█▅▄                                          
  ▁▁▁▁▁▁▂▂▃▄▅▇████████▅▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▂▂▂▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁ ▃
  584 ns           Histogram: frequency by time          687 ns <

 Memory estimate: 32 bytes, allocs estimate: 1.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.845 μs41.086 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.916 μs               GC (median):    0.00%
 Time  (mean ± σ):   14.097 μs ±  1.214 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▄▄▇▇▁▁▁▁▁▁▁▄▄▄▄▄▄▅▃▁▄▄▃▃▃▃▄▄▃▄▄▅▁▄▄▃▁▄▁▃▃▄▁▃▄▁▄▁▁▁▁▁▃▁▅█ █
  13.8 μs      Histogram: log(frequency) by time      22.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 545 evaluations per sample.
 Range (minmax):  207.062 ns267.376 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     211.861 ns                GC (median):    0.00%
 Time  (mean ± σ):   213.794 ns ±   5.606 ns   GC (mean ± σ):  0.00% ± 0.00%

            ▅█                                             
  ▂▁▂▂▂▂▂▃▄████▅▄▃▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▂▁▂▁▁▁▁▁▂▂▂▂▂▂▃▄▄▃▂▂▂▂▂▂▂▂▂▂ ▃
  207 ns           Histogram: frequency by time          232 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 190 evaluations per sample.
 Range (minmax):  522.021 ns 1.013 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     548.174 ns               GC (median):    0.00%
 Time  (mean ± σ):   553.774 ns ± 26.682 ns   GC (mean ± σ):  0.00% ± 0.00%

           ▂▄▆▇▆▃                                             
  ▁▁▂▂▃▄▄▆███████▅▄▃▃▂▂▂▁▁▁▁▁▂▂▂▃▂▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  522 ns          Histogram: frequency by time          641 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.885 μs36.557 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.955 μs               GC (median):    0.00%
 Time  (mean ± σ):   14.112 μs ±  1.156 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▄▃▇▇▁▁▁▃▁▁▃▄▄▄▁▃▃▁▃▁▄▃▃▁▄▃▃▃▄▄▃▅▄▃▃▃▃▁▁▁▁▃▁▁▁▃▁▃▁▁▃▁▁▁▄█ █
  13.9 μs      Histogram: log(frequency) by time      22.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 473 evaluations per sample.
 Range (minmax):  226.042 ns325.359 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     229.347 ns                GC (median):    0.00%
 Time  (mean ± σ):   231.678 ns ±   7.504 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▅█▇▇▂▁                                                    
  ▂▅▇██████▆▄▄▃▃▂▂▂▂▁▂▂▂▂▂▁▂▁▁▁▂▁▁▁▂▂▃▃▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂ ▃
  226 ns           Histogram: frequency by time          257 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  49.221 μs89.737 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     49.402 μs               GC (median):    0.00%
 Time  (mean ± σ):   49.925 μs ±  2.328 μs   GC (mean ± σ):  0.00% ± 0.00%

  ▇ ▂                                              ▂▂▁    ▂
  ███▁█▆▃▁▁▁▆▇▇▄▄▃▁▁▃▁▃▁▁▃▃▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆███▇█▇ █
  49.2 μs      Histogram: log(frequency) by time      58.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.664 μs 13.273 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.722 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.784 μs ± 321.746 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂▇                                                   ▂▂  ▂
  ████▇█▆▁▁▁▁▁▁▅▆▆▅▁▃▃▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▇███ █
  5.66 μs      Histogram: log(frequency) by time      7.18 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.18
  [09ab397b] StructArrays v0.7.3
  [9f78cca6] SummationByPartsOperators v0.5.96-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`