Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (minmax):  26.049 ns152.274 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     26.803 ns                GC (median):    0.00%
 Time  (mean ± σ):   27.143 ns ±   2.002 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▇▇█▄▁                                                     
  ▃███████▆▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▂▂▂▁▁▂▂▁▁▁▁▂▂▂▂▁▁▂▂▂▂▂▂▂▂▂▂ ▃
  26 ns           Histogram: frequency by time         34.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 570 evaluations per sample.
 Range (minmax):  204.223 ns410.098 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     214.981 ns                GC (median):    0.00%
 Time  (mean ± σ):   216.670 ns ±   7.479 ns   GC (mean ± σ):  0.00% ± 0.00%

             ▂▄▅▇▇█▇▅▄▂                                       
  ▂▁▂▂▂▃▃▄▅▆████████████▇▆▅▅▄▄▄▃▄▄▄▄▄▄▄▄▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
  204 ns           Histogram: frequency by time          240 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.88 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  384.167 ns632.507 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     387.424 ns                GC (median):    0.00%
 Time  (mean ± σ):   391.170 ns ±  13.246 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▅██▇▅▄▃▁                                 ▁▂▃▃▂▁            ▂
  ██████████▆▅▅▆▇▇▇▆▅▅▁▅▄▁▁▁▁▃▄▁▁▁▁▁▄▄▁▁▃▁▁▅█████████▇▆▅▅▅▅▄▅ █
  384 ns        Histogram: log(frequency) by time        433 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.533 μs  9.990 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.630 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.672 μs ± 230.765 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▂▆▇█▁▁▁                                            ▁▁▁  ▂
  ▇███████▇▆▅▁▃▁▄▄▅▆▅▆▄▄▁▃▁▁▁▁▁▁▁▁▃▁▁▁▁▁▃▁▁▁▁▁▁▃▁▁▁▆▇▇████ █
  4.53 μs      Histogram: log(frequency) by time      5.75 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, we compare it to a representation as a banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  9.518 μs30.917 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.548 μs               GC (median):    0.00%
 Time  (mean ± σ):   9.684 μs ±  1.019 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▅▆▅▅█▄▇▇▃▁▁▃▃▁▃▃▁▄▄▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▁▁▁▁▁▃▁▄▆ █
  9.52 μs      Histogram: log(frequency) by time       17 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.10.2
  [9f78cca6] SummationByPartsOperators v0.5.88 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 201 evaluations per sample.
 Range (minmax):  395.164 ns695.229 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     400.846 ns                GC (median):    0.00%
 Time  (mean ± σ):   407.534 ns ±  24.389 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▆█▆▅▂               ▂▃▃▂▁      ▁▁                           ▂
  ██████▇██▅▅▄▃▁▄▄▅▄▅██████▇▇▆▇▆▇████▇▇▅▆▆▅▃▅▆▆▄▆▅▅▄▃▄▆▅▄▄▃▁▄ █
  395 ns        Histogram: log(frequency) by time        502 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 21 evaluations per sample.
 Range (minmax):  989.952 ns 2.278 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):       1.005 μs               GC (median):    0.00%
 Time  (mean ± σ):     1.015 μs ± 60.633 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▃▇█   ▁                                                    ▂
  ███▅▅▇██▅▁▁▃▃▁▃▄▅▅▄▄▅▃▁▁▁▁▁▁▃▃▃▄▁▁▁▄▁▃▄▃▁▁▁▃▃▃▁▃▁▃▁▁▃▃▆███ █
  990 ns        Histogram: log(frequency) by time      1.35 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.011 μs 11.062 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.059 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.109 μs ± 274.358 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█   ▁▁                                               ▁▁  ▂
  ██▄▆██▆▃▃▃▃▁▃▅▅▆▅▅▅▃▃▁▁▁▁▃▇▅▁▁▃▁▁▃▁▁▁▁▁▄▃▃▁▁▁▁▃▁▃▁▁▅████ █
  5.01 μs      Histogram: log(frequency) by time      6.32 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.941 μs 17.841 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.985 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.057 μs ± 418.166 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆█   ▂                                              ▁▁    ▁
  ██▃███▁▁▁▁▁▁▄▅▆▅▅▄▁▃▃▃▅▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆█████▆ █
  6.94 μs      Histogram: log(frequency) by time      8.52 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  119.413 μs630.277 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     142.035 μs                GC (median):    0.00%
 Time  (mean ± σ):   144.870 μs ±  11.250 μs   GC (mean ± σ):  0.00% ± 0.00%

                           ▇█▆▄▄▄▃▃▂▂▅▄▃▃▂▂▂▂▁▁               ▂
  ▇▅▄▄▃▄▃▁▄▁▃▃▄▁▃▃▄▁▁▁▁▄▁▃▃████████████████████████▇█▇▇▇▇▇▆▆▆ █
  119 μs        Histogram: log(frequency) by time        169 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.10.2
  [9f78cca6] SummationByPartsOperators v0.5.88 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (minmax):  46.842 ns73.252 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     47.672 ns               GC (median):    0.00%
 Time  (mean ± σ):   48.161 ns ±  1.956 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▂▇█▇▅                                                   
  ▃███████▆▅▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▃▂▃▃▃▂▂▂▂▂ ▃
  46.8 ns         Histogram: frequency by time        55.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 228 evaluations per sample.
 Range (minmax):  323.368 ns609.868 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     376.930 ns                GC (median):    0.00%
 Time  (mean ± σ):   379.167 ns ±  23.413 ns   GC (mean ± σ):  0.00% ± 0.00%

               ▁    ▁▁▂▃▆██▆▅▃▃▁▂▂▃▁                          
  ▁▁▃▃▃▂▂▂▂▃▂▄▇██▇▇█████████████████▇▆▆▅▅▅▄▄▄▄▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁ ▄
  323 ns           Histogram: frequency by time          448 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.108 μs 3.101 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.120 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.131 μs ± 91.705 ns   GC (mean ± σ):  0.00% ± 0.00%

  █   ▁                                                     ▁
  █▅▄██▆▁▁▁▁▁▁▃▁▁▁▅▅▄▇▅▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▄▆ █
  1.11 μs      Histogram: log(frequency) by time     1.86 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.272 μs  3.889 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.280 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.294 μs ± 101.906 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▄▄▃▆▇▁▁▃▃▁▁▄▁▁▁▁▅▅▃▁▁▃▁▃▄▄▃▁▃▁▃▃▁▁▃▃▃▁▃▁▁▃▃▃▃▁▄▁▁▃▄▁▁▃▅█ █
  1.27 μs      Histogram: log(frequency) by time      1.98 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.563 μs  5.833 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.667 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.694 μs ± 162.118 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▄▇█                                                   
  ▂▃▄▇████▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▁▁▁▂▂▂▂▂▂▂ ▃
  2.56 μs         Histogram: frequency by time        3.49 μs <

 Memory estimate: 240 bytes, allocs estimate: 5.
D_full
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.280 μs 12.956 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.316 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.377 μs ± 335.266 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆   ▁                                              ▁▁    ▁
  ██▆▃██▅▁▁▁▃▁▁▁▃▅▅▅▄▄▄▁▃▄▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▃▃▃▁▃▁▄▇███▇▇ █
  6.28 μs      Histogram: log(frequency) by time      7.86 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 540 evaluations per sample.
 Range (minmax):  211.822 ns286.517 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     215.457 ns                GC (median):    0.00%
 Time  (mean ± σ):   217.158 ns ±   5.077 ns   GC (mean ± σ):  0.00% ± 0.00%

          ▃▆█▃▁                                                
  ▂▂▂▂▂▃▄▇█████▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▁▂▁▁▂▂▂▂▂▂▃▃▄▄▃▃▃▂▂▂▂▂▂ ▃
  212 ns           Histogram: frequency by time          231 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 163 evaluations per sample.
 Range (minmax):  644.209 ns 14.098 μs   GC (min … max): 0.00% … 93.11%
 Time  (median):     666.828 ns                GC (median):    0.00%
 Time  (mean ± σ):   675.356 ns ± 136.900 ns   GC (mean ± σ):  0.19% ±  0.93%

           ▃██▄                                                
  ▁▁▁▁▁▁▂▃▇█████▇▆▆▄▃▃▃▂▂▂▁▁▁▁▁▁▁▂▂▃▃▃▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  644 ns           Histogram: frequency by time          751 ns <

 Memory estimate: 32 bytes, allocs estimate: 1.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.906 μs41.107 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.995 μs               GC (median):    0.00%
 Time  (mean ± σ):   14.137 μs ±  1.091 μs   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▁▁▁▄▇▁▁▁▃▁▁▃▃▁▁▄▄▅▅▁▁▁▃▁▃▃▁▃▁▄▁▁▄▁▃▁▁▄▃▄▁▄▄▁▁▁▁▁▁▁▁▁▁▁▇▇ █
  13.9 μs      Histogram: log(frequency) by time      21.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 516 evaluations per sample.
 Range (minmax):  214.643 ns286.581 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     217.926 ns                GC (median):    0.00%
 Time  (mean ± σ):   219.740 ns ±   5.376 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▁▄▇█▆▄▃▂                              ▁▃▄▃▂▁▁           ▂
  ▄▃▃▄▆████████▇▆▅▅▇█▆▆▅▃▄▄▄▁▁▁▁▁▁▃▁▁▁▁▁▄▁▄▆████████▇▇▇▅▆▇▇▆▆ █
  215 ns        Histogram: log(frequency) by time        237 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 189 evaluations per sample.
 Range (minmax):  529.243 ns826.307 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     556.116 ns                GC (median):    0.00%
 Time  (mean ± σ):   560.635 ns ±  20.590 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▁▃▄▆▆█▇▇▇▆▆▅▂▂▁                                      
  ▁▁▁▂▃▅▇███████████████▇▇▅▄▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▃▂▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁ ▄
  529 ns           Histogram: frequency by time          626 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.856 μs40.235 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.937 μs               GC (median):    0.00%
 Time  (mean ± σ):   14.092 μs ±  1.057 μs   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▁▃▄██▄▁▁▁▁▁▁▁▃▄▅▅▅▄▃▁▁▁▁▁▁▁▁▄▃▁▃▁▃▁▁▁▃▁▃▁▁▁▃▃▁▃▁▃▃▃▁▃▃▆█ █
  13.9 μs      Histogram: log(frequency) by time      21.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 398 evaluations per sample.
 Range (minmax):  242.563 ns376.028 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     246.313 ns                GC (median):    0.00%
 Time  (mean ± σ):   248.562 ns ±   7.709 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▄▇█▆                                                  
  ▂▃▅███████▆▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▁▁▁▂▂▂▂▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂ ▃
  243 ns           Histogram: frequency by time          272 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  48.911 μs81.622 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     49.072 μs               GC (median):    0.00%
 Time  (mean ± σ):   49.537 μs ±  2.008 μs   GC (mean ± σ):  0.00% ± 0.00%

  ▇ ▂▁                                            ▂▂      ▁
  ██▄▅██▄▄▁▁▁▁▃▆▆▆▅▄▃▃▁▁▃▁▃▁▃▁▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆████▆▆▅▅ █
  48.9 μs      Histogram: log(frequency) by time      56.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.397 μs 12.727 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.447 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.510 μs ± 340.813 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂▇  ▁▁                                              ▁▁▁  ▂
  ███▆▅██▇▃▁▁▁▃▃▁▄▅▅▅▅▅▃▃▁▁▁▃▁▁▃▁▃▁▁▁▁▃▁▃▁▁▃▁▁▁▁▁▁▁▁▃▅█████ █
  5.4 μs       Histogram: log(frequency) by time      6.71 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.15
  [09ab397b] StructArrays v0.7.2
  [9f78cca6] SummationByPartsOperators v0.5.88 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`