Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (minmax):  27.186 ns50.979 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     27.860 ns               GC (median):    0.00%
 Time  (mean ± σ):   28.212 ns ±  1.757 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▄▇█                                                     
  ▅███▆▄▄▃▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▁▂▁▁▁▂▁▁▂▁▁▁▁▂▂▁▂▂▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  27.2 ns         Histogram: frequency by time        37.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 600 evaluations per sample.
 Range (minmax):  198.302 ns396.888 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     209.305 ns                GC (median):    0.00%
 Time  (mean ± σ):   211.130 ns ±   8.840 ns   GC (mean ± σ):  0.00% ± 0.00%

              ▂▄▆▆█▆▇█▅▃▂                                     
  ▁▁▁▁▁▂▂▄▅▆▇▇████████████▆▆▄▄▃▃▃▂▂▂▂▃▃▃▃▃▄▃▄▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁ ▃
  198 ns           Histogram: frequency by time          232 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.91 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  382.685 ns603.685 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     386.281 ns                GC (median):    0.00%
 Time  (mean ± σ):   390.251 ns ±  13.838 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▇█▇▆▄▃▁                                       ▂▂▃▃▂▁      ▂
  ██████████▇▇█▇▇▅▅▄▄▁▃▁▁▁▃▃▁▃▁▁▁▁▁▁▁▃▁▃▁▁▃▄▄▄▅▃▅▇████████▆▆▆ █
  383 ns        Histogram: log(frequency) by time        435 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.488 μs  8.783 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.584 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.639 μs ± 285.829 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▄▇█ ▁                                             ▁▁▁   ▂
  ▇██████▆▄▃▅▅▆▆▁▄▅▄▃▁▃▁▁▃▃▁▁▁▄▁▁▁▁▁▁▁▃▁▄▃▃▃▁▁▄▁▁▄▇▇█████▆ █
  4.49 μs      Histogram: log(frequency) by time      5.92 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, we compare it to a representation as a banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  9.507 μs76.666 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.537 μs               GC (median):    0.00%
 Time  (mean ± σ):   9.693 μs ±  1.282 μs   GC (mean ± σ):  0.00% ± 0.00%

                 ▁                                          ▁
  ▄▁▄▅▄▁▃▃▁▁▄▄▁▄█▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃█▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅ █
  9.51 μs      Histogram: log(frequency) by time       18 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.11.0
  [9f78cca6] SummationByPartsOperators v0.5.91 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 202 evaluations per sample.
 Range (minmax):  386.411 ns779.322 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     389.436 ns                GC (median):    0.00%
 Time  (mean ± σ):   393.478 ns ±  14.537 ns   GC (mean ± σ):  0.00% ± 0.00%

    ██                                                         
  ▃███▇▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▃▃▃▃▂▂▂▂▂▂▂ ▃
  386 ns           Histogram: frequency by time          439 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 14 evaluations per sample.
 Range (minmax):  986.786 ns 20.332 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     996.857 ns                GC (median):    0.00%
 Time  (mean ± σ):     1.022 μs ± 327.317 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                            ▁
  █▄▄█▄▁▄▁▁▆▇▇▅▅▃▄▅▃▃▃▅▆▆▅▆▅▃▃▃▁▃▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▆█▇ █
  987 ns        Histogram: log(frequency) by time       1.62 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.859 μs 10.894 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.902 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.960 μs ± 299.584 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█                                                  ▁▁▁   ▁
  ██▅█▇▄▁▃▁▆▆▆▃▄▃▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▇████▆ █
  4.86 μs      Histogram: log(frequency) by time      6.23 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.883 μs 16.070 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.909 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.004 μs ± 528.775 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                ▁▁        ▁
  █▅▁▇▁▁▁▁▃▆▆▅▄▁▃▁▁▄▁▁▁▁▁▁▃▁▁▁▁▁▃▃▁▁▁▃▁▁▁▁▁▁▃▁▁▃▇████▇▆▆▅▅▃ █
  6.88 μs      Histogram: log(frequency) by time      8.91 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  115.094 μs291.032 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     116.486 μs                GC (median):    0.00%
 Time  (mean ± σ):   119.118 μs ±   7.947 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▆█▅▃▂▂▂              ▁▃▃▃▂▁▁                               ▁
  ██████████▆▇▆▆▆▆▆▇▆▇▆▇██████████▇▆▇▆▅▅▆▅▅▄▄▃▅▄▄▅▆▅▄▅▄▅▂▅▄▂▄ █
  115 μs        Histogram: log(frequency) by time        143 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.11.0
  [9f78cca6] SummationByPartsOperators v0.5.91 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (minmax):  46.214 ns91.708 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     47.470 ns               GC (median):    0.00%
 Time  (mean ± σ):   48.096 ns ±  2.592 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▃▅▇██▇▇▅▄▃▂▁                               ▂▂▁▂▁         ▃
  ███████████████▇▇▅▆▇▆█▆▄▃▅▃▁▃▄▁▁▃▁▁▁▃▁▁▁▄▅█████████▇▇▇▆▆▆ █
  46.2 ns      Histogram: log(frequency) by time      58.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 230 evaluations per sample.
 Range (minmax):  321.687 ns678.613 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     376.265 ns                GC (median):    0.00%
 Time  (mean ± σ):   379.058 ns ±  23.571 ns   GC (mean ± σ):  0.00% ± 0.00%

                     ▁▂▄▅▇██▄▃▃▁▁▁                           
  ▁▁▂▂▂▂▂▂▂▂▂▃▄▆▇▇▇▇▇█████████████▆▆▅▅▄▄▄▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁ ▄
  322 ns           Histogram: frequency by time          448 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.118 μs  4.114 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.131 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.149 μs ± 127.122 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▅▄▅▅▁▁▁▁▁▁▃▅▅▃▁▃▁▁▁▁▁▁▁▃▁▃▆█▁▁▁▁▁▁▃▁▁▁▃▁▃▁▁▁▃▃▁▁▁▁▁▁▃▄▇▇ █
  1.12 μs      Histogram: log(frequency) by time      2.03 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.271 μs  2.998 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.280 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.293 μs ± 101.293 ns   GC (mean ± σ):  0.00% ± 0.00%

  █ ▁▁▁▁▂▁▁▁▁▁▁▂▂▂▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂ ▂
  1.27 μs         Histogram: frequency by time        2.13 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.504 μs  5.823 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.624 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.658 μs ± 191.913 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▃█                                                    
  ▂▂▄▇██▅▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▁▂▂▁▁▂▁▂▁▂▁▁▁▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  2.5 μs          Histogram: frequency by time        3.64 μs <

 Memory estimate: 240 bytes, allocs estimate: 5.
D_full
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.352 μs 18.214 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.400 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.501 μs ± 534.005 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇                                       ▁▁▁              ▁
  ██▆▅▄▁▁▄▆▆▇▅▃▃▃▁▃▃▃▁▁▁▁▃▁▁▃▁▃▃▁▁▁▃▁▁▄▁▁▅▇███▇▆▅▅▅▅▅▅▁▄▅▄▄ █
  6.35 μs      Histogram: log(frequency) by time      8.83 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 535 evaluations per sample.
 Range (minmax):  210.129 ns324.006 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     214.456 ns                GC (median):    0.00%
 Time  (mean ± σ):   216.493 ns ±   6.127 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▂▄▅▇██▇▅▄▃▁                            ▁▃▃▄▄▃▂▁        ▃
  ▃▄▄▆██████████████▇▇▃▄▄▃▅▅▁▄▄▃▃▅▃▁▁▁▃▁▃▃▁▁▆▇██████████▇▆▇▇▇ █
  210 ns        Histogram: log(frequency) by time        235 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 182 evaluations per sample.
 Range (minmax):  568.747 ns 11.722 μs   GC (min … max): 0.00% … 92.85%
 Time  (median):     610.643 ns                GC (median):    0.00%
 Time  (mean ± σ):   617.414 ns ± 114.237 ns   GC (mean ± σ):  0.18% ±  0.93%

                   ▂▄▅▆▇█▅▄▃                                  
  ▂▁▂▂▁▂▂▂▂▃▃▃▄▄▆▇███████████▆▅▄▄▃▃▂▂▂▂▂▂▃▂▃▃▄▄▄▄▃▄▃▃▃▃▃▂▂▂▂▂ ▄
  569 ns           Histogram: frequency by time          680 ns <

 Memory estimate: 32 bytes, allocs estimate: 1.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.976 μs42.840 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     14.056 μs               GC (median):    0.00%
 Time  (mean ± σ):   14.221 μs ±  1.186 μs   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▁▁▄▆▆▁▃▁▁▃▄▅▅▇█▄▁▃▁▃▁▄▃▁▄▁▄▃▃▁▃▃▁▁▁▃▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▄▆▇ █
  14 μs        Histogram: log(frequency) by time      22.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 540 evaluations per sample.
 Range (minmax):  210.520 ns295.326 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     213.974 ns                GC (median):    0.00%
 Time  (mean ± σ):   216.042 ns ±   5.990 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▁▂▄▆██▆▅▃▁  ▁                             ▂▃▄▄▃▁▁         ▂
  ▅▅▆██████████████▇▆▆▃▁▃▄▁▃▄▁▁▁▁▃▄▃▃▃▃▁▃▁▁▅▆█████████▇▇▇▇▇▆▇ █
  211 ns        Histogram: log(frequency) by time        234 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 192 evaluations per sample.
 Range (minmax):  507.141 ns836.870 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     524.052 ns                GC (median):    0.00%
 Time  (mean ± σ):   529.391 ns ±  18.967 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▂▄▆██▇▇▅▄▂                                            
  ▂▂▂▃▃▅███████████▇▆▅▄▄▄▃▃▃▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▃▃▄▄▄▃▃▃▄▃▃▃▃▂▂▂▂▂ ▄
  507 ns           Histogram: frequency by time          582 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.825 μs198.609 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.895 μs                GC (median):    0.00%
 Time  (mean ± σ):   15.616 μs ±   9.554 μs   GC (mean ± σ):  0.00% ± 0.00%

        ▃ ▁                                                  ▁
  █▆▇▇▆█▇█▅▅▄▄▁▁▄▁▁▁▁▄▁▁▄▁▁▃▁▃▁▃▁▄▁▁▁▁▃▃▁▄▃▃▃▄▄▄▅▄▅▆▆▆▆▅▅▆▆ █
  13.8 μs       Histogram: log(frequency) by time      65.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 478 evaluations per sample.
 Range (minmax):  224.077 ns358.996 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     227.412 ns                GC (median):    0.00%
 Time  (mean ± σ):   229.624 ns ±   7.413 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▃▆██▄▁                                                   
  ▂▂▄▇██████▇▅▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▁▂▁▁▁▁▁▂▁▂▂▂▂▂▃▃▃▄▄▃▃▃▂▂▂▂▂▂▂ ▃
  224 ns           Histogram: frequency by time          250 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  48.911 μs99.095 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     49.071 μs               GC (median):    0.00%
 Time  (mean ± σ):   49.595 μs ±  2.313 μs   GC (mean ± σ):  0.00% ± 0.00%

  █                                                ▁▂▂     ▂
  ██▅██▁▁▁▁▇▇▅▅▃▃▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇█████▇▅ █
  48.9 μs      Histogram: log(frequency) by time      58.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.479 μs 15.278 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.517 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.627 μs ± 530.798 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                        ▂▁                                ▁
  █▃▇▆▆▅▄▁▁▃▁▁▁▁▃▁▄▃▁▄▃▄▇██▇▅▅▆▅▆▄▅▄▄▅▄▄▅▅▃▃▁▄▄▃▄▄▃▃▄▃▅▄▄▆ █
  5.48 μs      Histogram: log(frequency) by time      8.86 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.18
  [09ab397b] StructArrays v0.7.3
  [9f78cca6] SummationByPartsOperators v0.5.91 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`