Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (minmax):  26.150 ns61.513 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     26.895 ns               GC (median):    0.00%
 Time  (mean ± σ):   27.419 ns ±  2.292 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅███▅▄▃▂▂▁                          ▁                    ▃
  ████████████▆▆▅▅▅▅▅▁▅▃▄▄▄▅▅▃▁▄▅▅▃▆██████▇▆▆▆▆▅▅▆▅▅▆▅▆▅▅▅ █
  26.1 ns      Histogram: log(frequency) by time      38.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 580 evaluations per sample.
 Range (minmax):  201.793 ns382.684 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     209.688 ns                GC (median):    0.00%
 Time  (mean ± σ):   211.518 ns ±   7.292 ns   GC (mean ± σ):  0.00% ± 0.00%

              ▂▄▅█▆█▅▂                                         
  ▁▁▁▁▂▂▂▃▄▅▇██████████▆▆▄▄▃▃▂▂▂▁▂▁▁▁▁▁▁▂▂▂▂▃▃▃▃▃▃▂▂▂▂▂▁▂▁▁▁▁ ▃
  202 ns           Histogram: frequency by time          229 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.91-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 202 evaluations per sample.
 Range (minmax):  387.064 ns604.599 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     390.089 ns                GC (median):    0.00%
 Time  (mean ± σ):   394.454 ns ±  14.972 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▇█▆▄▂                               ▂▃▃▂▁                   ▂
  ██████▇▅▇█▇▇▆▃▁▄▁▁▃▃▅▃▃▃▅▃▁▃▁▁▃▁▁▄▁███████▆▆▆▆▆▆▆▆▆▆▆▆▅▄▄▄▃ █
  387 ns        Histogram: log(frequency) by time        450 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.382 μs 10.040 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.467 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.516 μs ± 270.742 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▄▇█▁▁                                              ▁▁▁  ▂
  ▆██████▇▆▃▅▄▅▆▅▆▄▄▃▆▇▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▆█████ █
  4.38 μs      Histogram: log(frequency) by time      5.65 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, we compare it to a representation as a banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  9.527 μs30.207 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.558 μs               GC (median):    0.00%
 Time  (mean ± σ):   9.713 μs ±  1.107 μs   GC (mean ± σ):  0.00% ± 0.00%

              ▁                                             ▁
  ▅▁▁▆▆▃▁▁▁▁▁█▄▄▄▃▁▃▁▁▁▁▁▁▇▄▁▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▃▅▆ █
  9.53 μs      Histogram: log(frequency) by time     17.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.11.0
  [9f78cca6] SummationByPartsOperators v0.5.91-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  384.468 ns544.222 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     388.389 ns                GC (median):    0.00%
 Time  (mean ± σ):   391.892 ns ±  12.062 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▆█                                                    
  ▂▃████▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂▂▁▂▁▂▁▁▁▂▁▁▁▁▁▂▁▁▁▂▂▂▃▃▃▃▃▂▂▂▂▂ ▃
  384 ns           Histogram: frequency by time          432 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 14 evaluations per sample.
 Range (minmax):  985.429 ns 2.320 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     996.143 ns               GC (median):    0.00%
 Time  (mean ± σ):     1.006 μs ± 72.205 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆                                                          ▁
  █▆▁▃▅▅▃▁▁▁▁▄▃▃▅▄▄▃▃▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▆▇ █
  985 ns        Histogram: log(frequency) by time      1.55 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.190 μs 22.116 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.317 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.494 μs ± 977.479 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█             ▂▁                                         ▁
  ██▇▇▆▅▇▅▅▅▆▅▅███▇▆▅▆▇▅▅▅▄▄▅▅▅▅▅▅▄▄▅▄▄▅▅▅▄▃▃▄▁▃▄▁▄▅▄▅▄▃▄▅ █
  5.19 μs      Histogram: log(frequency) by time        10 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.907 μs 18.567 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.983 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.076 μs ± 478.179 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▇█▄▄▁                                            ▁▁▁     ▂
  ██████▇▅▆▃▁▁▃▆▆▅▄▄▃▄▁▃▁▃▃▁▁▃▁▁▃▃▁▁▁▁▁▁▁▃▁▃▁▁▁▁▃▃▅▆████▇▆▆ █
  6.91 μs      Histogram: log(frequency) by time      8.69 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  116.228 μs310.553 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     144.220 μs                GC (median):    0.00%
 Time  (mean ± σ):   145.749 μs ±   7.966 μs   GC (mean ± σ):  0.00% ± 0.00%

  ▂▃                           ▆█▄▃▂▁▁▂▄▅▄▃▂▁▁               ▂
  ███▅▄▃▁▃▃▄▁▇▆▆▅▄▃▅▄▁▁▁▁▁▁▁▁▃█████████████████████▇▇▇▆▇▇▆▆▇ █
  116 μs        Histogram: log(frequency) by time        171 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.11.0
  [9f78cca6] SummationByPartsOperators v0.5.91-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (minmax):  45.465 ns91.476 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     47.096 ns               GC (median):    0.00%
 Time  (mean ± σ):   47.676 ns ±  2.359 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▃▆█                                                 
  ▂▂▂▃▆█████▆▄▄▃▃▃▂▂▂▂▂▂▂▁▂▁▂▂▂▂▁▁▂▁▂▁▁▂▂▁▂▂▂▂▂▃▃▂▂▂▂▂▂▂▂▂▂ ▃
  45.5 ns         Histogram: frequency by time          57 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 228 evaluations per sample.
 Range (minmax):  326.750 ns631.404 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     388.184 ns                GC (median):    0.00%
 Time  (mean ± σ):   390.474 ns ±  22.191 ns   GC (mean ± σ):  0.00% ± 0.00%

                       ▃▅▆▅▆▆█▇▇▇▆▃ ▁                         
  ▁▁▁▁▂▁▁▁▁▁▂▃▄▄▄▄▄▅▆▆████████████████▇▆▆▆▇▅▄▃▃▄▃▃▃▃▃▃▂▂▂▂▁▁▂ ▄
  327 ns           Histogram: frequency by time          453 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.110 μs  4.885 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.123 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.141 μs ± 130.895 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▆▁▄█▁▃▁▃▃▁▁▄▅▄▃▁▄▁▁▁▁▁▁▁▄▁▁▁▃▁▁▁▁▃▁▃▁▁▁▁▁▁▃▁▁▃▃▁▆▃▃▄▅▇▇▇ █
  1.11 μs      Histogram: log(frequency) by time      1.96 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.275 μs  3.707 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.284 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.299 μs ± 103.395 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▁▄▄█▆▁▁▁▃▁▃▄▅▃▄▄▄▄▄▃▁▃▄▁▃▄▁▁▄▄▃▃▃▄▄▁▄▃▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▇ █
  1.28 μs      Histogram: log(frequency) by time      2.05 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.511 μs  8.459 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.647 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.694 μs ± 196.057 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▆█▅▂▂                                                  
  ▂▂▄██████▆▅▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▁▁▂▂▂▂▁▂▁▁▂▂▂▂▂▂▂▂▂▂ ▃
  2.51 μs         Histogram: frequency by time        3.61 μs <

 Memory estimate: 240 bytes, allocs estimate: 5.
D_full
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.262 μs 13.367 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.314 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.383 μs ± 357.744 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅█                                                    ▁▁  ▁
  ██▅██▅▃▄▁▃▃▃▅▆▇▅▇▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▁▃▄▇████ █
  6.26 μs      Histogram: log(frequency) by time      8.01 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 545 evaluations per sample.
 Range (minmax):  208.980 ns288.633 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     212.196 ns                GC (median):    0.00%
 Time  (mean ± σ):   214.073 ns ±   5.245 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▂█▅                                                   
  ▂▂▂▂▂▃▅███▇▆▄▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▂▂▂▁▁▂▁▂▁▂▁▂▂▂▂▂▂▃▃▃▃▃▂▂▂▂▂▂▂ ▃
  209 ns           Histogram: frequency by time          229 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 189 evaluations per sample.
 Range (minmax):  533.593 ns 9.877 μs   GC (min … max): 0.00% … 92.05%
 Time  (median):     555.116 ns               GC (median):    0.00%
 Time  (mean ± σ):   562.941 ns ± 97.151 ns   GC (mean ± σ):  0.16% ±  0.92%

           ▂▃▆█▇▃▁                                            
  ▂▁▂▂▂▃▄▅▇███████▇▇▆▅▄▄▃▄▄▃▃▃▂▂▂▂▂▂▃▃▃▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  534 ns          Histogram: frequency by time          623 ns <

 Memory estimate: 32 bytes, allocs estimate: 1.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.907 μs49.783 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.986 μs               GC (median):    0.00%
 Time  (mean ± σ):   14.160 μs ±  1.201 μs   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▁▃▃▇▅▃▁▃▃▃▃▁▅▅▄▅▄▄▃▃▁▃▄▁▄▁▃▃▅▅▃▃▃▁▃▅▃▆▅▄▅▄▄▁▁▁▁▁▁▁▁▁▁▄▇▇ █
  13.9 μs      Histogram: log(frequency) by time      21.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 550 evaluations per sample.
 Range (minmax):  207.807 ns297.047 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     212.289 ns                GC (median):    0.00%
 Time  (mean ± σ):   214.104 ns ±   5.382 ns   GC (mean ± σ):  0.00% ± 0.00%

            ▄█▆                                                
  ▂▂▂▂▂▂▂▃▄█████▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▁▁▂▂▂▂▂▃▄▃▃▃▃▂▂▂▂▂▂ ▃
  208 ns           Histogram: frequency by time          229 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 193 evaluations per sample.
 Range (minmax):  505.508 ns961.497 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     525.808 ns                GC (median):    0.00%
 Time  (mean ± σ):   531.023 ns ±  23.215 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▁▁▄▅▆█▇▅▃                                             
  ▁▁▂▃▄▇▇██████████▆▅▄▃▂▂▂▁▁▁▁▁▁▂▁▂▂▂▂▃▃▃▂▃▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  506 ns           Histogram: frequency by time          596 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.796 μs54.442 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.876 μs               GC (median):    0.00%
 Time  (mean ± σ):   14.150 μs ±  1.512 μs   GC (mean ± σ):  0.00% ± 0.00%

  █                                  ▁▁                      ▁
  █▁▄▇▆█▇▄▁▁▁▁▃▄▅▅▄▅▃▃▃▁▄▃▃▃▅▅▄▁▄▄▃██▅▃▄▄▃▃▁▃▁▁▄▁▁▃▁▁▁▄▆▆▇ █
  13.8 μs      Histogram: log(frequency) by time      21.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 482 evaluations per sample.
 Range (minmax):  223.822 ns331.929 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     227.272 ns                GC (median):    0.00%
 Time  (mean ± σ):   229.333 ns ±   6.960 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▁▅██▇                                                
  ▂▃▄▇███████▆▅▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂▂▂▃▃▃▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂ ▃
  224 ns           Histogram: frequency by time          249 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  48.871 μs98.265 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     49.493 μs               GC (median):    0.00%
 Time  (mean ± σ):   49.963 μs ±  2.430 μs   GC (mean ± σ):  0.00% ± 0.00%

  ▅▆▁▇  ▁                                           ▁▂▁▁   ▂
  ██████▇█▆▁▁▄▅▆▇▆▅▅▁▁▃▃▁▁▁▄▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▃▁▃▃▃▄▆▆██████▇ █
  48.9 μs      Histogram: log(frequency) by time      57.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.392 μs 11.483 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.440 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.658 μs ± 427.186 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇▂  ▁          ▅▅▅▃                                        ▂
  ██▄██▄█▅▃▅▆▅▁▄████▇▅▆▄▄▁▄▁▄▅▃▁▁▁▄▃▁▃▆██▇▇▇▆▆▅▄▄▆▄▅▄▃▆▆▇▇▇ █
  5.39 μs      Histogram: log(frequency) by time      7.37 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.17
  [09ab397b] StructArrays v0.7.2
  [9f78cca6] SummationByPartsOperators v0.5.91-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`