Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (minmax):  26.301 ns54.383 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     27.096 ns               GC (median):    0.00%
 Time  (mean ± σ):   27.434 ns ±  1.637 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▃██▆                                                     
  ▃██████▅▅▄▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▂▂▁▁▁▂▂▁▁▁▁▁▂▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  26.3 ns         Histogram: frequency by time        35.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 565 evaluations per sample.
 Range (minmax):  203.690 ns355.427 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     214.046 ns                GC (median):    0.00%
 Time  (mean ± σ):   215.964 ns ±   7.875 ns   GC (mean ± σ):  0.00% ± 0.00%

              ▄▃█▇▇▇▄▃▄▂                                      
  ▁▁▁▁▁▂▂▃▄▅▆████████████▆▆▅▄▃▂▃▂▂▂▂▂▂▂▃▃▃▃▃▃▃▃▂▃▂▂▂▂▂▁▁▁▁▁▁▁ ▃
  204 ns           Histogram: frequency by time          237 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.95 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  384.409 ns583.897 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     387.128 ns                GC (median):    0.00%
 Time  (mean ± σ):   390.708 ns ±  12.070 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂▆██▆▅▃                                            ▁▃▃▃▂▁   ▂
  ███████▇▄▄▃▄▆█▇▇▅▄▅▁▁▁▁▁▃▁▁▁▃▁▁▁▁▃▁▁▁▁▁▃▄▁▃▁▁▁▁▁▁▃▅███████▆ █
  384 ns        Histogram: log(frequency) by time        431 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.364 μs  9.747 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.433 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.476 μs ± 247.416 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▆█                                                  ▁▂▁ ▂
  ████▃▃▁▃▁▁▁▅▅▆▄▅▅▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▃▁▁▁▁▁▁▅▇███ █
  4.36 μs      Histogram: log(frequency) by time      5.63 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, we compare it to a representation as a banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  9.497 μs 34.034 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.518 μs                GC (median):    0.00%
 Time  (mean ± σ):   9.620 μs ± 883.412 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄▁▄▃▄▇▁▁▁▁▁▁▁▁▃▄▄▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃ █
  9.5 μs       Histogram: log(frequency) by time        16 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.11.0
  [9f78cca6] SummationByPartsOperators v0.5.95 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  384.069 ns670.759 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     386.882 ns                GC (median):    0.00%
 Time  (mean ± σ):   390.674 ns ±  13.474 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂▇█▇▅▃▁                                       ▃▃▃▂▁          ▂
  ███████▅▄▄▆▇▇▇▆▅▄▄▃▄▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▄███████▆▅▆▄▄▆▅▆ █
  384 ns        Histogram: log(frequency) by time        437 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 22 evaluations per sample.
 Range (minmax):  983.636 ns 1.972 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     990.955 ns               GC (median):    0.00%
 Time  (mean ± σ):     1.000 μs ± 58.118 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇                                                        ▁ ▁
  █▁▃▆▅▁▁▁▁▁▁▅▅▅▃▃▁▄▁▁▁▁▃▁▁▁▁▃▃▁▁▄▁▁▃▄▁▃▁▄▃▁▁▄▁▁▃▁▁▁▃▁▁▁▅██ █
  984 ns        Histogram: log(frequency) by time      1.36 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.929 μs  8.675 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.028 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.075 μs ± 253.788 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▆                                                     
  ▂▃▇██▅▂▂▁▂▁▁▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂▂▂▂▂▂▂ ▂
  4.93 μs         Histogram: frequency by time        6.24 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.865 μs 15.507 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.889 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.958 μs ± 405.864 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                      ▁▁  ▁
  █▁▁▁▁▁▁▁▁▃▅▆▅▅▃▃▃▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▆███ █
  6.86 μs      Histogram: log(frequency) by time      8.56 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  118.381 μs212.277 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     142.206 μs                GC (median):    0.00%
 Time  (mean ± σ):   142.953 μs ±  11.489 μs   GC (mean ± σ):  0.00% ± 0.00%

  ▅▅▃▁▁        ▁▁          ▅█▄▃▂▂▃▃▃▃▄▆▅▃▃▂▁  ▁▁▁▁▂▂          ▂
  █████▇▇▇▅▅▆▆▇████▇▆▅▆▄▃▁▅███████████████████████████▇▇▇▇▇▄▆ █
  118 μs        Histogram: log(frequency) by time        172 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.11.0
  [9f78cca6] SummationByPartsOperators v0.5.95 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (minmax):  45.414 ns91.506 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     48.433 ns               GC (median):    0.00%
 Time  (mean ± σ):   49.017 ns ±  2.207 ns   GC (mean ± σ):  0.00% ± 0.00%

           ▃▆▇██▇▆▆▅▄▂▂ ▁▂▁                      ▁▂▂▂▂▁▁    ▃
  ▄▅▅▅▅▆▅▃▅████████████████▇▇▇▅▁▁▆▇▄▇▅▃▁▃▁▁▅▁▁▃▅▆█████████▇ █
  45.4 ns      Histogram: log(frequency) by time      57.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 228 evaluations per sample.
 Range (minmax):  323.193 ns796.618 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     372.667 ns                GC (median):    0.00%
 Time  (mean ± σ):   375.098 ns ±  21.435 ns   GC (mean ± σ):  0.00% ± 0.00%

                    ▂▁▃▃▄▆▆█▇▅▂▂ ▁▁                           
  ▁▂▂▃▂▂▂▂▂▂▂▂▄▄▇▆▇▇███████████████▇▅▆▄▃▄▄▃▄▃▃▃▃▃▃▂▂▂▂▂▂▂▁▂▁▁ ▄
  323 ns           Histogram: frequency by time          437 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.111 μs  4.053 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.123 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.135 μs ± 105.744 ns   GC (mean ± σ):  0.00% ± 0.00%

  █  ▂▂▂▂▂▂▁▁▁▁▁▁▂▂▂▁▁▁▁▁▁▁▁▁▁▁▂▂▁▁▁▂▂▁▂▁▁▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▂ ▂
  1.11 μs         Histogram: frequency by time        1.95 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.276 μs  4.538 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.284 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.298 μs ± 104.124 ns   GC (mean ± σ):  0.00% ± 0.00%

  █  ▁▂▂▁▁▁▁▁▁▁▂▂▂▂▁▂▁▁▁▁▁▁▁▁▂▂▂▂▁▁▁▂▂▂▁▂▂▁▁▁▂▂▁▁▁▁▁▁▁▁▂▁▁▁▂▂ ▂
  1.28 μs         Histogram: frequency by time        2.09 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.774 μs  5.668 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.870 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.899 μs ± 170.303 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▂▇                                                     
  ▂▃▆██▅▃▂▂▂▂▁▁▂▂▂▂▂▂▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂ ▃
  2.77 μs         Histogram: frequency by time        3.83 μs <

 Memory estimate: 240 bytes, allocs estimate: 5.
D_full
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.226 μs 12.405 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.270 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.338 μs ± 365.628 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅▄▂                                                    ▁▁  ▁
  ███▅▆▃▁▃▄▁▃▁▄▆▅▅▄▄▁▁▁▁▁▁▁▃▃▃▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▄▁▁▄▅▇███ █
  6.23 μs      Histogram: log(frequency) by time      8.02 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 540 evaluations per sample.
 Range (minmax):  208.907 ns266.757 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     211.469 ns                GC (median):    0.00%
 Time  (mean ± σ):   213.421 ns ±   5.504 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▂█                                                 
  ▂▂▂▂▄███▆▄▃▂▂▂▂▂▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▂▂▂▂▃▃▃▃▂▂▂▂▂▂▂▂▂ ▃
  209 ns           Histogram: frequency by time          230 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 183 evaluations per sample.
 Range (minmax):  560.995 ns 10.309 μs   GC (min … max): 0.00% … 90.98%
 Time  (median):     593.896 ns                GC (median):    0.00%
 Time  (mean ± σ):   600.487 ns ± 100.829 ns   GC (mean ± σ):  0.16% ±  0.91%

                  ▂▄██▅▃                                       
  ▁▁▁▁▁▁▁▂▂▂▂▃▄▅▆████████▄▃▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▂▃▃▃▃▂▂▂▁▁▁▁▁▁▁▁▁ ▂
  561 ns           Histogram: frequency by time          662 ns <

 Memory estimate: 32 bytes, allocs estimate: 1.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.845 μs61.615 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.916 μs               GC (median):    0.00%
 Time  (mean ± σ):   14.070 μs ±  1.161 μs   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▄▄▇▇▁▁▁▁▃▃▃▄▄▄▄▄▃▁▁▄▄▃▁▃▄▄▄▃▃▄▁▃▃▃▄▁▁▁▃▄▁▁▁▁▁▁▁▁▁▁▁▃▁▁▅█ █
  13.8 μs      Histogram: log(frequency) by time      22.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 540 evaluations per sample.
 Range (minmax):  210.152 ns344.661 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     213.233 ns                GC (median):    0.00%
 Time  (mean ± σ):   215.236 ns ±   5.860 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▁▄▇█▇▆▄▂                                 ▁▃▄▄▄▂▁         ▂
  ▃▃▄▇████████▇▅▇██▇▆▆▅▄▃▅▃▁▁▃▃▄▁▃▁▁▄▃▁▁▁▁▁▁▁▅████████▇▆▅▆▇▇▆ █
  210 ns        Histogram: log(frequency) by time        233 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 188 evaluations per sample.
 Range (minmax):  544.048 ns899.122 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     561.585 ns                GC (median):    0.00%
 Time  (mean ± σ):   567.136 ns ±  20.060 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▁▃▅███▄▂                                              
  ▁▁▁▂▃▅▆█████████▄▄▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▃▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁ ▃
  544 ns           Histogram: frequency by time          628 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.796 μs38.222 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.875 μs               GC (median):    0.00%
 Time  (mean ± σ):   14.024 μs ±  1.122 μs   GC (mean ± σ):  0.00% ± 0.00%

  █  ▁▂▂▂▁▂▂▁▁▁▁▂▂▂▁▂▂▂▂▂▁▁▁▂▁▁▁▁▁▁▂▂▂▂▂▁▂▂▁▂▁▁▂▁▁▁▂▂▂▁▁▁▁▁▂▂ ▂
  13.8 μs         Histogram: frequency by time        22.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 478 evaluations per sample.
 Range (minmax):  224.226 ns335.941 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     227.958 ns                GC (median):    0.00%
 Time  (mean ± σ):   230.116 ns ±   6.647 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▂▆█▇                                                
  ▂▂▃▄▇█████▇▆▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▁▂▂▂▂▂▃▃▄▄▃▃▃▃▂▂▂▂▂▂▂▂ ▃
  224 ns           Histogram: frequency by time          250 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  48.560 μs99.275 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     48.731 μs               GC (median):    0.00%
 Time  (mean ± σ):   49.265 μs ±  2.540 μs   GC (mean ± σ):  0.00% ± 0.00%

  ▇                                                ▂▂     ▁
  ██▅▁▅▃▁▁▁▁▄▇▆▅▅▅▃▄▁▃▃▁▁▁▁▁▁▁▁▃▁▁▃▃▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▆████▇▆▆ █
  48.6 μs      Histogram: log(frequency) by time      57.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.370 μs 12.073 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.420 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.474 μs ± 293.661 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▃█                                                    ▁▂▁ ▂
  ██▁▄▃▁▁▁▁▁▁▃▄▆▅▅▅▄▁▁▁▁▁▃▁▃▃▁▃▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▃▁▆███ █
  5.37 μs      Histogram: log(frequency) by time      6.84 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.11
Commit a2b11907d7b (2026-03-09 14:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.18
  [09ab397b] StructArrays v0.7.3
  [9f78cca6] SummationByPartsOperators v0.5.95 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`