Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (minmax):  24.750 ns66.940 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     27.045 ns               GC (median):    0.00%
 Time  (mean ± σ):   27.329 ns ±  1.689 ns   GC (mean ± σ):  0.00% ± 0.00%

           ▄█▆▆                                            
  ▂▂▂▂▂▂▂▂▃████▇▄▃▃▃▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▁▂▂▂▂▂▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  24.7 ns         Histogram: frequency by time        35.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 590 evaluations per sample.
 Range (minmax):  197.505 ns365.788 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     208.858 ns                GC (median):    0.00%
 Time  (mean ± σ):   210.394 ns ±   6.972 ns   GC (mean ± σ):  0.00% ± 0.00%

                 ▄▄▆▇▆██▇▆▄▂                                  
  ▁▁▁▁▁▂▂▂▂▃▄▄▆██████████████▇▆▅▄▄▂▃▃▂▂▂▃▃▃▃▃▄▃▃▃▃▃▃▂▂▂▂▂▂▂▁▁ ▄
  198 ns           Histogram: frequency by time          228 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.90 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  383.773 ns618.695 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     386.783 ns                GC (median):    0.00%
 Time  (mean ± σ):   390.425 ns ±  12.474 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▆██▆▅▃▁                                      ▂▃▃▃▂▁        ▂
  ████████▇▇▃▄▆▇█▇▆▅▅▄▃▃▃▁▃▃▁▃▃▃▃▄▁▃▁▁▄▄▃▃▃▁▅▄▇█████████▆▆▆▇▆ █
  384 ns        Histogram: log(frequency) by time        432 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.382 μs  9.601 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.464 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.527 μs ± 299.271 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▅█▆▂▁                                    ▁▁▁             ▂
  ▇██████▇█▇▆█▇▇▆▆▆▅▃▄▄▄▄▅▄▁▃▁▄▃▃▄▁▃▃▃▁▁▁▃▆▇████▆▆▆▅▅▄▆▇▅▃▄ █
  4.38 μs      Histogram: log(frequency) by time      5.88 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, we compare it to a representation as a banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  9.527 μs38.852 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.608 μs               GC (median):    0.00%
 Time  (mean ± σ):   9.776 μs ±  1.080 μs   GC (mean ± σ):  0.00% ± 0.00%

  █       ▁                                                 ▁
  █▄▁▇█▆██▆▅▅▄▅▄▅▅▁▅▄▁▃▁▁▃▁▁▁▁▄▁▃▃▁▁▁▃▁▃▃▁▁▁▁▁▁▁█▃▁▁▃▁▁▁▃▄ █
  9.53 μs      Histogram: log(frequency) by time     16.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.11.0
  [9f78cca6] SummationByPartsOperators v0.5.90 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  383.478 ns623.581 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     386.537 ns                GC (median):    0.00%
 Time  (mean ± σ):   390.244 ns ±  12.608 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▆██▆▅▄▃▁                                     ▁▂▃▃▂▁▁       ▂
  ██████████▆▆▇▇▇▅▆▅▄▄▁▁▁▁▁▃▁▄▁▁▁▁▃▁▁▃▁▁▁▁▁▃▁▄▅▇████████▇▇▇▅▆ █
  383 ns        Histogram: log(frequency) by time        431 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 14 evaluations per sample.
 Range (minmax):  981.786 ns 2.760 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     992.571 ns               GC (median):    0.00%
 Time  (mean ± σ):     1.003 μs ± 72.410 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆                                                          ▁
  █▅▅▄▅█▆▅▅▆▆▅▇▆▅▅▄▁▄▄▁▃▅▄▃▁▁▄▁▁▁▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▄▅▆▇ █
  982 ns        Histogram: log(frequency) by time      1.53 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.095 μs 11.199 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.163 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.220 μs ± 281.040 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▆█                                                 ▁▁▁   ▂
  ██████▆▅▇▇▇▆▇▇▆▇▅▆▅▄▁▃▄▄▄▄▁▁▁▁▃▃▁▁▄▁▃▁▃▁▁▁▁▃▁▁▁▁▃▆▇████▇ █
  5.09 μs      Histogram: log(frequency) by time      6.51 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.066 μs 16.892 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.128 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.211 μs ± 435.727 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█   ▁                                                 ▁   ▁
  ██▇█▇▇▇▆▇▇▆▆▇▇▇▅▄▅▅▄▅▄▄▄▃▁▄▄▃▁▁▁▃▁▄▁▃▁▁▃▁▁▁▁▁▁▁▁▁▃▆█████ █
  7.07 μs      Histogram: log(frequency) by time       9.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  115.166 μs312.146 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     116.328 μs                GC (median):    0.00%
 Time  (mean ± σ):   118.356 μs ±   5.802 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▅█▂▁▁▁             ▃▄▁                                     ▁
  ▇███████▇▇▆▇▅▄▅▄▅▅▆██████▇▆▆▇▆▆▆▅▅▅▅▅▆▆▆▅▄▄▄▅▅▄▃▅▄▄▅▃▄▃▄▄▃▄ █
  115 μs        Histogram: log(frequency) by time        142 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.11.0
  [9f78cca6] SummationByPartsOperators v0.5.90 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 988 evaluations per sample.
 Range (minmax):  47.771 ns80.637 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     48.644 ns               GC (median):    0.00%
 Time  (mean ± σ):   49.125 ns ±  2.039 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▇▇█▇▄▁                                                   
  ▃▇██████▅▄▃▃▂▂▂▂▂▂▂▂▁▂▂▂▁▂▂▁▂▁▁▁▁▁▁▂▁▁▂▂▁▂▂▁▂▂▂▂▂▂▃▃▃▂▂▂▂ ▃
  47.8 ns         Histogram: frequency by time          57 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 229 evaluations per sample.
 Range (minmax):  322.179 ns715.926 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     375.114 ns                GC (median):    0.00%
 Time  (mean ± σ):   377.644 ns ±  25.730 ns   GC (mean ± σ):  0.00% ± 0.00%

                   ▁▂▃▂▄▇█▇▄▃▃▂▁▁▁                           
  ▁▂▃▃▃▃▂▃▂▃▃▅▆▇███████████████████▆▆▅▄▄▄▄▄▄▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁ ▄
  322 ns           Histogram: frequency by time          450 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.107 μs  3.418 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.118 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.134 μs ± 105.157 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                  ▁                       ▁
  █▄▁▇▇▇▅▄▄▆▄▅▅▁▅▄▄▅▃▄▄▁▃▃▁▁▃▁▃▁▃▄▁▁█▅▁▁▁▄▄▄▁▅▅▄▃▃▄▃▁▁▁▁▁▄▆ █
  1.11 μs      Histogram: log(frequency) by time      1.87 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.277 μs  5.129 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.288 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.301 μs ± 107.008 ns   GC (mean ± σ):  0.00% ± 0.00%

  █ ▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▂▁▂▂▁▂▂▂▁▂▁▂▂▂▁▂▁▁▂▁▂▂▂▁▁▁▁▁▁▁▁▂▂▂▂▂ ▂
  1.28 μs         Histogram: frequency by time        2.03 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.468 μs  5.388 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.556 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.582 μs ± 153.495 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▂▆                                                     
  ▂▃▅██▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▂▁▂▂▂▁▂▂▂▂▂▂▂▁▂▂▁▂▂▁▂▁▂▂▂▂▂▂▂ ▃
  2.47 μs         Histogram: frequency by time        3.42 μs <

 Memory estimate: 240 bytes, allocs estimate: 5.
D_full
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.262 μs 13.575 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.298 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.382 μs ± 404.906 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇ ▁                                              ▁▁      ▂
  ██▅██▆▅▇█▇▅▇▆▇▇▆▅▆▄▄▅▄▄▄▄▅▁▁▃▄▁▄▃▃▃▁▁▁▁▁▃▃▁▁▁▃▄▆████▇▇▆▄▄ █
  6.26 μs      Histogram: log(frequency) by time      8.07 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 540 evaluations per sample.
 Range (minmax):  211.026 ns278.987 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     215.404 ns                GC (median):    0.00%
 Time  (mean ± σ):   217.590 ns ±   5.913 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▁▄▆██▆▅▄▃▁▁                ▁▂▃▃▄▃▂▁▁                    ▂
  ▃▄▅██████████████▇▆▆▆▅▃▆▅▄▅▄▄▆███████████▆▅▆▅▄▃▃▂▄▃▂▆▄▆▆▅▆▇ █
  211 ns        Histogram: log(frequency) by time        240 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 183 evaluations per sample.
 Range (minmax):  570.033 ns 9.887 μs   GC (min … max): 0.00% … 91.45%
 Time  (median):     593.519 ns               GC (median):    0.00%
 Time  (mean ± σ):   600.403 ns ± 98.611 ns   GC (mean ± σ):  0.15% ±  0.91%

           ▂▆▇█▅▁                                             
  ▁▁▁▁▂▂▃▅███████▇▃▂▂▂▁▁▁▁▁▁▁▁▁▂▂▃▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  570 ns          Histogram: frequency by time          679 ns <

 Memory estimate: 32 bytes, allocs estimate: 1.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.916 μs32.841 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.987 μs               GC (median):    0.00%
 Time  (mean ± σ):   14.164 μs ±  1.034 μs   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▄▆██▇▆▆▆▅▅▅▅▆▅█▆▄▄▄▃▄▃▄▄▁▁▁▁▁▁▃▁▃▄▁▃▄▄▄▃▁▁▁▁▁▁▁▁▁▁▁▁▃▅▆▇ █
  13.9 μs      Histogram: log(frequency) by time      21.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 545 evaluations per sample.
 Range (minmax):  208.906 ns264.385 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     212.655 ns                GC (median):    0.00%
 Time  (mean ± σ):   214.436 ns ±   5.109 ns   GC (mean ± σ):  0.00% ± 0.00%

          ▃█▇                                                  
  ▂▂▂▂▂▂▃▅████▅▄▃▂▂▂▂▂▂▂▂▂▁▁▂▂▁▂▁▁▁▁▁▂▁▂▁▂▂▂▂▂▃▃▄▃▃▃▂▂▂▂▂▂▂▂▂ ▃
  209 ns           Histogram: frequency by time          230 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 187 evaluations per sample.
 Range (minmax):  553.123 ns991.321 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     579.053 ns                GC (median):    0.00%
 Time  (mean ± σ):   583.387 ns ±  20.446 ns   GC (mean ± σ):  0.00% ± 0.00%

            ▁▃▃▄▅▅█▇█▇▅▃                                      
  ▁▁▂▂▃▃▄▆▆███████████████▆▆▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▃▃▃▃▃▃▃▃▃▂▂▂▂▂▁▁▁▁ ▄
  553 ns           Histogram: frequency by time          637 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.906 μs49.242 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.976 μs               GC (median):    0.00%
 Time  (mean ± σ):   14.142 μs ±  1.100 μs   GC (mean ± σ):  0.00% ± 0.00%

  █   ▁▁                                                     ▁
  █▃▆██▇▅▆▅▆▅▅▆▅▅▅▆▄▄▄▄▁▃▄▃▁▁▁▃▁▁▁▁▄▁▁▃▄▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▄▆▇ █
  13.9 μs      Histogram: log(frequency) by time      21.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 473 evaluations per sample.
 Range (minmax):  225.985 ns 1.140 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     229.628 ns               GC (median):    0.00%
 Time  (mean ± σ):   236.182 ns ± 32.765 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆▆▁   ▄▃▁                                                   ▂
  ███▄████▇▆▆▅▄▅▆▄▅▅▄▁▅▄▄▄▅▃▃▄▃▃▁▁▃▄▃▆▅▅▆▆▆▆▅▄▅▅▃▅▆▆▅▅▆▆▆▅▆▆ █
  226 ns        Histogram: log(frequency) by time       373 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  48.912 μs111.589 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     49.062 μs                GC (median):    0.00%
 Time  (mean ± σ):   49.550 μs ±   2.264 μs   GC (mean ± σ):  0.00% ± 0.00%

  ▇                                                 ▁▂▁    ▁
  ██▇▇█▇▆▆▄▅▄▆▆▇▅▅▃▃▃▄▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▄▁▃▃▁▁▁▃▁▁▁▁▆▇██████▆ █
  48.9 μs       Histogram: log(frequency) by time      57.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.432 μs 11.144 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.472 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.536 μs ± 298.732 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█   ▁                                                ▁▁   ▁
  ██▇██▇▆▆▇▇█▇▇▇▇▆▅▅▅▅▅▄▄▁▄▁▃▁▁▃▁▁▃▃▄▁▁▁▁▃▁▁▁▃▁▁▁▃▃▃▇████▇ █
  5.43 μs      Histogram: log(frequency) by time      6.82 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.16
  [09ab397b] StructArrays v0.7.2
  [9f78cca6] SummationByPartsOperators v0.5.90 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`