Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 996 evaluations per sample.
 Range (minmax):  25.520 ns51.905 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     26.254 ns               GC (median):    0.00%
 Time  (mean ± σ):   26.538 ns ±  1.416 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▂▇██                                                    
  ▃█████▇▆▅▄▃▃▂▂▂▂▂▂▂▁▂▂▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▂▁▂▁▂▁▁▁▁▂▂▂▂▂▂▂▂▂ ▃
  25.5 ns         Histogram: frequency by time        33.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 600 evaluations per sample.
 Range (minmax):  197.403 ns371.212 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     208.073 ns                GC (median):    0.00%
 Time  (mean ± σ):   209.556 ns ±   6.954 ns   GC (mean ± σ):  0.00% ± 0.00%

               ▂▂▅▆▄█▅▇▇▆▄▂                                   
  ▂▂▁▂▂▂▃▃▄▄▆▇▇██████████████▆▇▅▅▅▄▄▄▃▄▄▄▄▄▄▄▅▄▄▃▄▃▃▃▃▃▃▃▂▂▂▂ ▄
  197 ns           Histogram: frequency by time          227 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.89 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  382.734 ns545.355 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     386.138 ns                GC (median):    0.00%
 Time  (mean ± σ):   389.599 ns ±  11.690 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▅▇█▇▆▅▄▂                                    ▁▃▃▃▂▁▁        ▂
  ██████████▇▅▅▆▇▇▇▆▅▅▅▅▃▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▇█████████▇▅▆▆▆ █
  383 ns        Histogram: log(frequency) by time        430 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.380 μs  8.565 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.448 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.492 μs ± 229.335 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂▅█▇▅▁▁                                             ▁▁ ▁  ▂
  ████████▇▅▅▁▆▇▆▆▆▅▄▁▁▁▃▁▁▁▃▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▇█████ █
  4.38 μs      Histogram: log(frequency) by time      5.57 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, we compare it to a representation as a banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  9.517 μs 33.192 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.548 μs                GC (median):    0.00%
 Time  (mean ± σ):   9.676 μs ± 959.758 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▁                                                       ▁
  ▅▅█▆█▄▅▆▅▄▃▃▅▄▅▃▁▃▁▄▁▁▃▁▁▁▁▁▁▁▁▁▁▃▁▁▄▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▅ █
  9.52 μs      Histogram: log(frequency) by time      17.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.11.0
  [9f78cca6] SummationByPartsOperators v0.5.89 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  385.547 ns499.256 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     388.507 ns                GC (median):    0.00%
 Time  (mean ± σ):   391.802 ns ±  10.771 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▅██▇▅▄▄▂                                       ▂▃▃▂▂▁▁     ▂
  ██████████▇▆▆▅▆█▇▇▅▃▃▄▃▁▁▄▁▁▁▁▃▃▄▁▁▁▄▁▁▄▄▁▃▅▃▄▆█████████▇▆▆ █
  386 ns        Histogram: log(frequency) by time        430 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.069 μs 3.907 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.079 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.090 μs ± 90.063 ns   GC (mean ± σ):  0.00% ± 0.00%

  █  ▂▂▂▂▂▂▁▁▁▁▂▁▁▂▂▂▂▂▂▁▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂ ▂
  1.07 μs        Histogram: frequency by time        1.79 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.938 μs  8.968 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.987 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.034 μs ± 232.938 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄▇▇▄  ▁▁                                              ▁▁   ▂
  ████▇██▇▄▃▁▁▄▅▆▆▅▅▁▁▃▄▁▁▁▁▁▁▃▃▅▅▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▃▆▇█████ █
  4.94 μs      Histogram: log(frequency) by time      6.12 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.907 μs 17.559 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.951 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.030 μs ± 424.543 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅█   ▁▁                                             ▁▁     ▁
  ██▅██▅▃▁▃▃▃▅▆▆▅▅▃▁▁▁▁▃▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▇▄▁▁▁▁▁▁▁▁▄▆████▇▇▅ █
  6.91 μs      Histogram: log(frequency) by time      8.56 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  138.328 μs308.567 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     140.593 μs                GC (median):    0.00%
 Time  (mean ± σ):   142.806 μs ±   5.530 μs   GC (mean ± σ):  0.00% ± 0.00%

  ▂▆▇▇█▇▅▄▃▃▂▂          ▁▃▃▄▄▄▃▂▂▁▁                           ▂
  ███████████████▇▇█▆▆▇▇███████████████▇▆▅▆▇▆▆▆▅▅▄▄▅▅▄▆▅▅▅▇▅▆ █
  138 μs        Histogram: log(frequency) by time        163 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.11.0
  [9f78cca6] SummationByPartsOperators v0.5.89 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (minmax):  45.809 ns80.444 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     47.106 ns               GC (median):    0.00%
 Time  (mean ± σ):   47.599 ns ±  1.943 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▂▆█▇▃▁                                                 
  ▂▂▄▆██████▆▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▂▂▂▂▃▃▃▃▃▂▂▂▂▂ ▃
  45.8 ns         Histogram: frequency by time        55.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 230 evaluations per sample.
 Range (minmax):  322.170 ns519.887 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     371.087 ns                GC (median):    0.00%
 Time  (mean ± σ):   372.636 ns ±  19.597 ns   GC (mean ± σ):  0.00% ± 0.00%

                       ▁▃▄▄▆██▅▄▃▃▂▁                          
  ▁▁▂▂▂▃▂▂▂▃▂▂▃▄▅▇▇▇██████████████████▇▆▆▄▄▄▄▄▄▃▄▃▃▃▃▂▂▂▂▂▁▁▂ ▄
  322 ns           Histogram: frequency by time          427 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.115 μs 3.100 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.126 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.140 μs ± 94.781 ns   GC (mean ± σ):  0.00% ± 0.00%

  █    ▁                                                    ▁
  █▅▅▃█▆▁▁▁▁▁▁▁▄▃▄▄▄▁▁▃▃▁▁▁█▃▃▁▃▁▃▃▁▄▁▁▁▃▁▁▁▃▁▃▁▃▁▁▁▁▁▁▃▄▆ █
  1.12 μs      Histogram: log(frequency) by time     1.89 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.275 μs 3.378 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.284 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.297 μs ± 91.706 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                         ▁
  █▃▃▃▇█▁▃▁▁▁▃▁▄▅▄▄▄▄▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▅▇ █
  1.28 μs      Histogram: log(frequency) by time        2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.640 μs  8.091 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.938 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.974 μs ± 221.138 ns   GC (mean ± σ):  0.00% ± 0.00%

           ▂▅▇█▂                                            
  ▂▂▂▂▃▃▄▆██████▇▅▄▃▂▂▂▂▂▂▂▂▁▂▂▂▂▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  4.64 μs         Histogram: frequency by time        6.04 μs <

 Memory estimate: 240 bytes, allocs estimate: 5.
D_full
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.280 μs 18.645 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.314 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.390 μs ± 432.111 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇  ▂▁                                              ▁▁    ▁
  ██▅▅██▄▄▃▁▃▃▃▅▆▅▅▃▄▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▄▃▃▄▁▃▁▁▁▁▃▄▇▁▄▆▇███▇▆ █
  6.28 μs      Histogram: log(frequency) by time      7.92 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 540 evaluations per sample.
 Range (minmax):  210.893 ns295.274 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     214.865 ns                GC (median):    0.00%
 Time  (mean ± σ):   216.698 ns ±   5.441 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▂▅█▃▁                                                 
  ▂▂▂▂▃▄▆█████▇▅▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▂▂▂▂▂▃▃▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  211 ns           Histogram: frequency by time          234 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 185 evaluations per sample.
 Range (minmax):  556.389 ns  9.897 μs   GC (min … max): 0.00% … 92.62%
 Time  (median):     578.595 ns                GC (median):    0.00%
 Time  (mean ± σ):   589.011 ns ± 104.715 ns   GC (mean ± σ):  0.16% ±  0.93%

     ▁█                                                      
  ▂▂▄███▅▃▂▂▂▃▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▁▂▂▂▂▂▁▁▁▂▁▁▁▁▂ ▃
  556 ns           Histogram: frequency by time          808 ns <

 Memory estimate: 32 bytes, allocs estimate: 1.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.906 μs 30.217 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.976 μs                GC (median):    0.00%
 Time  (mean ± σ):   14.105 μs ± 940.558 ns   GC (mean ± σ):  0.00% ± 0.00%

  █  ▁▂▂▂▂▂▂▁▁▁▂▁▂▂▂▂▂▂▁▁▂▁▁▁▁▂▁▁▂▂▂▂▂▁▁▂▁▁▁▁▂▁▁▁▁▂▁▁▁▁▁▁▁▁▂▂▂ ▂
  13.9 μs         Histogram: frequency by time         21.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 550 evaluations per sample.
 Range (minmax):  207.442 ns270.835 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     211.649 ns                GC (median):    0.00%
 Time  (mean ± σ):   213.152 ns ±   5.126 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▃█▇▃▄▅▇▄▁                                              
  ▂▂▂▂▄▇█████████▅▄▃▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▂▂▂▂▂▂▂▃▄▃▄▃▃▄▄▃▃▃▂▂▂▂▂▂▂ ▃
  207 ns           Histogram: frequency by time          229 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 187 evaluations per sample.
 Range (minmax):  549.743 ns953.706 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     569.401 ns                GC (median):    0.00%
 Time  (mean ± σ):   573.920 ns ±  18.475 ns   GC (mean ± σ):  0.00% ± 0.00%

          ▁▂▃▄▅▇▇█▅▄▁                                         
  ▁▁▂▃▄▅▇████████████▇▆▅▅▄▃▃▂▂▂▁▂▁▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▂▂▂▂▂▂▁▂▂▁▁▁ ▃
  550 ns           Histogram: frequency by time          625 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.786 μs 36.348 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.875 μs                GC (median):    0.00%
 Time  (mean ± σ):   14.002 μs ± 971.493 ns   GC (mean ± σ):  0.00% ± 0.00%

  █  ▂▁▂▂▂▁▁▁▁▂▁▁▂▂▂▂▂▂▂▁▂▁▁▁▂▂▂▂▁▁▁▂▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂ ▂
  13.8 μs         Histogram: frequency by time         21.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 478 evaluations per sample.
 Range (minmax):  224.226 ns318.042 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     227.872 ns                GC (median):    0.00%
 Time  (mean ± σ):   229.756 ns ±   6.091 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▂▄███▅▃▂                                                 
  ▂▂▄▆████████▇▄▄▃▃▃▂▂▂▂▂▂▂▂▁▂▂▂▁▁▁▁▂▁▁▂▂▂▃▃▃▃▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂ ▃
  224 ns           Histogram: frequency by time          248 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  48.841 μs198.612 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     49.021 μs                GC (median):    0.00%
 Time  (mean ± σ):   49.513 μs ±   2.586 μs   GC (mean ± σ):  0.00% ± 0.00%

  ▆ ▂▁                                           ▁▁▁       ▁
  ██▇▄██▃▃▁▃▁▅▆▆▆▅▄▃▁▁▄▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆█████▇▆▆▆▅ █
  48.8 μs       Histogram: log(frequency) by time      57.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.482 μs 10.239 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.545 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.601 μs ± 281.684 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▃▇█  ▁▁                                           ▁▁      ▂
  ███▇██▇▄▃▃▁▄▃▅▄▅▄▁▄▄▃▁▁▁▁▁▃▁▁▄▁▃▁▃▃▁▃▁▁▃▁▃▄▁▁▃▄▅█████▇▆▆ █
  5.48 μs      Histogram: log(frequency) by time       6.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.16
  [09ab397b] StructArrays v0.7.2
  [9f78cca6] SummationByPartsOperators v0.5.89 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`