Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (minmax):  24.689 ns53.225 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     27.045 ns               GC (median):    0.00%
 Time  (mean ± σ):   27.323 ns ±  1.458 ns   GC (mean ± σ):  0.00% ± 0.00%

             ▆██▁                                           
  ▂▂▂▂▂▂▂▂▂▂▆████▆▄▃▃▂▂▂▂▂▂▂▁▂▂▂▂▂▂▁▁▂▁▁▁▁▂▁▁▁▁▂▁▂▁▂▂▂▂▂▂▂▂ ▃
  24.7 ns         Histogram: frequency by time        34.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 611 evaluations per sample.
 Range (minmax):  198.015 ns370.661 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     208.862 ns                GC (median):    0.00%
 Time  (mean ± σ):   210.335 ns ±   6.761 ns   GC (mean ± σ):  0.00% ± 0.00%

               ▂▂▅▆▅██▆▆▃▂                                    
  ▁▁▁▁▁▁▂▃▃▄▅▅██████████████▇▆▅▄▄▄▃▃▃▃▃▃▃▃▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁ ▄
  198 ns           Histogram: frequency by time          229 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.89-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  384.118 ns574.916 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     388.064 ns                GC (median):    0.00%
 Time  (mean ± σ):   391.519 ns ±  12.034 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▄▇▇█▇▆▅▃▁                                    ▁▂▂▃▃▂▂▁▁     ▂
  ▇██████████▆▅▅▇▇▇▇▅▅▁▅▁▃▃▁▁▃▃▃▃▄▁▁▁▁▁▃▁▁▃▁▃▅▆▇█████████▆▇▆▆ █
  384 ns        Histogram: log(frequency) by time        431 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.385 μs  8.456 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.511 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.560 μs ± 266.889 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▃▆▇█▁                                            ▁▁    ▂
  ▄▇█████▆▅▃▄▁▄▅▅▅▅▅▄▄▃▁▁▁▃▁▃▃▃▁▁▁▃▁▁▃▄▄▅▅▁▃▁▁▁▁▃▅▆██████▇ █
  4.39 μs      Histogram: log(frequency) by time      5.69 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, we compare it to a representation as a banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  9.517 μs38.202 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.548 μs               GC (median):    0.00%
 Time  (mean ± σ):   9.698 μs ±  1.093 μs   GC (mean ± σ):  0.00% ± 0.00%

     ▁                                                      ▁
  ▆▆█▄▄▁▁▁▃▁▃▁▁▁▄▄▄▃▁▁▁▁▁▁▃▁▇▇▁▃▁▁▃▁▁▃▁▃▁▁▁▁▁▁▁▁▆▁▃▁▃▁▆▄▃▄ █
  9.52 μs      Histogram: log(frequency) by time     16.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.11.0
  [9f78cca6] SummationByPartsOperators v0.5.89-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  383.281 ns777.123 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     387.030 ns                GC (median):    0.00%
 Time  (mean ± σ):   390.530 ns ±  13.299 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▅█                                                    
  ▂▃▇███▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▂▁▂▁▂▁▂▂▁▂▂▂▂▂▃▃▃▃▂▂▂▂▂ ▃
  383 ns           Histogram: frequency by time          428 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.066 μs 3.279 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.073 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.085 μs ± 90.913 ns   GC (mean ± σ):  0.00% ± 0.00%

  █  ▂▂▁▂▂▂▂▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  1.07 μs        Histogram: frequency by time        1.79 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.123 μs 15.890 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.562 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.638 μs ± 392.288 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▅▇█▅▂                                              
  ▂▂▂▂▃▅██████▇▆▅▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▂▂▂▂▁▂▁▂▁▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  7.12 μs         Histogram: frequency by time        9.58 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.023 μs 15.642 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.061 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.134 μs ± 424.915 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇                                                       ▁  ▁
  █▁▃▅▃▃▁▁▁▁▃▃▆▅▅▅▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃▆▆▁▁▁▃▁▁▁▃▇████ █
  7.02 μs      Histogram: log(frequency) by time         9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  115.125 μs292.608 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     116.558 μs                GC (median):    0.00%
 Time  (mean ± σ):   118.643 μs ±   5.721 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▄▇█▄▂▁▁              ▂▄▃▂▁                                 ▁
  ▇████████▇▇▆▇▅▅▅▅▅▄▄▅▇███████████▇▇▆▅▄▅▅▅▅▅▅▅▄▄▅▅▅▄▅▆▆▄▅▅▅▄ █
  115 μs        Histogram: log(frequency) by time        139 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.11.0
  [9f78cca6] SummationByPartsOperators v0.5.89-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (minmax):  47.268 ns98.678 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     48.169 ns               GC (median):    0.00%
 Time  (mean ± σ):   48.691 ns ±  2.087 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂▅▇███▇▆▅▅▄▄▃▂▁▁▁                               ▂▂▁▁▁▁    ▃
  ███████████████████▆▇▅▅▅▃▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃▁▄▇██████████▇ █
  47.3 ns      Histogram: log(frequency) by time      56.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 228 evaluations per sample.
 Range (minmax):  324.553 ns576.342 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     378.250 ns                GC (median):    0.00%
 Time  (mean ± σ):   380.647 ns ±  20.020 ns   GC (mean ± σ):  0.00% ± 0.00%

                       ▁▂▄▆█▇▆▄▃▃▁                           
  ▁▁▁▂▂▁▂▂▂▂▂▂▂▄▄▄▄▅▆▇█████████████▆▆▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁ ▄
  325 ns           Histogram: frequency by time          440 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.110 μs  3.320 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.121 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.138 μs ± 110.674 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▅▁▄▅▃▁▁▃▁▁▁▁▄▄▅▄▃▇█▃▃▃▄▁▁▁▁▁▁▃▁▃▃▄▁▄▃▅▄▃▃▄▄▄▄▄▃▄▅▅▄▄▅▅▄▇ █
  1.11 μs      Histogram: log(frequency) by time      1.88 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.271 μs  3.486 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.282 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.296 μs ± 100.629 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▄▃▁▅█▃▁▁▃▁▁▁▃▁▅▅▄▄▃▁▁▃▃▄▄▃▃▁▄▃▁▁▅▃▃▄▃▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▇ █
  1.27 μs      Histogram: log(frequency) by time      2.01 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.509 μs  6.079 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.622 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.647 μs ± 154.305 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▃██▁                                                  
  ▂▂▄▆████▅▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▂▂▂▂▂▂ ▃
  2.51 μs         Histogram: frequency by time        3.48 μs <

 Memory estimate: 240 bytes, allocs estimate: 5.
D_full
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.258 μs 14.878 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.300 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.395 μs ± 455.482 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                 ▁▁▁                     ▂
  ██▆▃▃▁▄▅▇▆▃▁▁▃▄▃▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▃▇████▆▆▅▅▅▄▃▄▃▅▃▁▃▁▅▅▆▆▇ █
  6.26 μs      Histogram: log(frequency) by time      8.73 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 516 evaluations per sample.
 Range (minmax):  210.899 ns286.855 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     214.764 ns                GC (median):    0.00%
 Time  (mean ± σ):   216.502 ns ±   5.207 ns   GC (mean ± σ):  0.00% ± 0.00%

           ▃▇                                             
  ▂▂▁▂▂▂▃▃▆████▄▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▂▂▂▁▂▁▁▂▁▂▂▂▂▂▃▃▃▃▃▃▂▂▂▂▂ ▃
  211 ns           Histogram: frequency by time          232 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 183 evaluations per sample.
 Range (minmax):  567.617 ns 11.081 μs   GC (min … max): 0.00% … 93.15%
 Time  (median):     592.475 ns                GC (median):    0.00%
 Time  (mean ± σ):   611.928 ns ± 146.018 ns   GC (mean ± σ):  0.17% ±  0.93%

   ▄▇▆▃ ▁▄▄▂                                                   ▂
  ▇████████▇▇▆▇▇▇▅▅▃▅▃▄▅▁▁▄▁▃▄▅▄▃▅▄▅▃▁▃▅▄▁▅▅▅▅▆▅▅▅▃▅▇▇▇▆▅▅▄▅▆ █
  568 ns        Histogram: log(frequency) by time          1 μs <

 Memory estimate: 32 bytes, allocs estimate: 1.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.936 μs35.015 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     14.036 μs               GC (median):    0.00%
 Time  (mean ± σ):   14.185 μs ±  1.028 μs   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▃▄▄█▇▄▄▃▄▁▃▃▄▄▇▇▇▃▄▃▁▁▄▃▁▃▃▃▁▁▁▁▃▁▁▁▁▁▃▃▁▁▃▄▁▁▁▁▁▁▁▁▁▅▆▇ █
  13.9 μs      Histogram: log(frequency) by time      21.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 535 evaluations per sample.
 Range (minmax):  213.615 ns279.738 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     216.013 ns                GC (median):    0.00%
 Time  (mean ± σ):   217.851 ns ±   5.168 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▁▃▆▇█▇▆▅▄▂                                ▁▂▃▄▄▃▂▁        ▂
  ▆▇███████████▆▇███▇▆▅▅▅▄▁▁▄▁▁▁▄▄▆▃▃▁▄▅▄▄▃▄▇▆█████████▇▇▇▅▆▇ █
  214 ns        Histogram: log(frequency) by time        233 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 188 evaluations per sample.
 Range (minmax):  534.723 ns 1.006 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     575.335 ns               GC (median):    0.00%
 Time  (mean ± σ):   580.300 ns ± 24.045 ns   GC (mean ± σ):  0.00% ± 0.00%

                  ▄▇█                                       
  ▂▂▂▂▂▂▂▂▂▃▃▃▃▅▆█████▅▄▃▃▂▂▂▂▂▂▂▃▃▄▄▄▃▃▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  535 ns          Histogram: frequency by time          662 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.895 μs39.134 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.956 μs               GC (median):    0.00%
 Time  (mean ± σ):   14.108 μs ±  1.071 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▄▄▁▇▇▆▃▁▃▃▁▃▄▄▅▄▁▃▄▃▄▄▃▃▁▃▃▄▄▃▃▁▃▁▁▁▁▄▁▄▁▃▁▁▁▁▁▁▁▁▁▁▁▄▇▇ █
  13.9 μs      Histogram: log(frequency) by time      21.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 473 evaluations per sample.
 Range (minmax):  225.772 ns358.516 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     229.797 ns                GC (median):    0.00%
 Time  (mean ± σ):   231.938 ns ±   7.272 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▁▆█▅▂                                                    
  ▂▂▃▆█████▇▃▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▂▂▂▃▃▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  226 ns           Histogram: frequency by time          259 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  48.791 μs89.057 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     48.981 μs               GC (median):    0.00%
 Time  (mean ± σ):   49.452 μs ±  2.187 μs   GC (mean ± σ):  0.00% ± 0.00%

  ▆                                               ▁▁▁     ▁
  ██▅▃██▃▁▁▁▁▄▆▇▆▅▄▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▅██████▆▆ █
  48.8 μs      Histogram: log(frequency) by time        57 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.370 μs 11.015 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.407 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.459 μs ± 287.278 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅█                                                     ▁   ▁
  ██▃▅▆▃▁▁▁▁▁▃▅▆▅▅▃▃▁▁▁▁▁▃▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▅▇████▇ █
  5.37 μs      Histogram: log(frequency) by time      6.75 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.16
  [09ab397b] StructArrays v0.7.2
  [9f78cca6] SummationByPartsOperators v0.5.89-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`