Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (minmax):  26.290 ns57.908 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     26.955 ns               GC (median):    0.00%
 Time  (mean ± σ):   27.234 ns ±  1.410 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▂███                                                    
  ▃████▇▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▁▁▂▂▂▂▂▂▂▂ ▃
  26.3 ns         Histogram: frequency by time        34.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 606 evaluations per sample.
 Range (minmax):  197.498 ns346.210 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     210.840 ns                GC (median):    0.00%
 Time  (mean ± σ):   212.249 ns ±   7.130 ns   GC (mean ± σ):  0.00% ± 0.00%

              ▂▃▄▆▆▇█▆▇▇█▄▃▃▁                                
  ▁▁▁▁▁▁▂▃▄▅▆▇████████████████▇▆▅▅▅▅▄▄▄▄▃▄▄▄▄▃▃▃▃▃▂▂▂▂▂▁▁▁▁▁ ▄
  197 ns           Histogram: frequency by time          233 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.89-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  380.665 ns653.734 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     385.453 ns                GC (median):    0.00%
 Time  (mean ± σ):   388.791 ns ±  15.416 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▅█▁▁█▃                                                      
  ▃██████▆▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▁▂▁▁▂▁▁▂▁▁▂▂▂▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  381 ns           Histogram: frequency by time          434 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.506 μs  8.480 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.609 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.649 μs ± 226.271 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▃▅▇█▂                                              ▁▁ ▁ ▂
  ███████▇▇▅▁▁▄▅▅▆▆▅▅▃▃▁▃▁▁▁▁▁▃▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▄▇█████ █
  4.51 μs      Histogram: log(frequency) by time      5.72 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, we compare it to a representation as a banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  9.618 μs32.551 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.648 μs               GC (median):    0.00%
 Time  (mean ± σ):   9.780 μs ±  1.020 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▅▅▅▅▇▃▄▁▃▃▁▃█▄▄▄▄▃▁▁▃▁▃▁▁▁▁▁▁▁▁▃▁▁▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▅ █
  9.62 μs      Histogram: log(frequency) by time     17.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.10.2
  [9f78cca6] SummationByPartsOperators v0.5.89-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  383.724 ns524.429 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     387.522 ns                GC (median):    0.00%
 Time  (mean ± σ):   390.853 ns ±  11.210 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▁▅▇█▇▆▄▃▁▁                                      ▂▃▃▃▂▂     ▂
  ▆██████████▇▆▆▆▇▇▇▇▆▅▁▄▁▃▃▃▁▁▃▁▁▁▁▁▁▃▁▃▁▁▁▁▁▃▁▁▆█████████▇▇ █
  384 ns        Histogram: log(frequency) by time        428 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 15 evaluations per sample.
 Range (minmax):  982.533 ns 2.102 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     995.200 ns               GC (median):    0.00%
 Time  (mean ± σ):     1.004 μs ± 65.076 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▃█                                                          ▁
  ██▃▁▄█▇▁▁▄▁▁▃▁▄▄▅▄▄▄▁▃▁▁▁▁▁▁▃▃▁▁▃▁▁▃▁▃▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▃▄▆█ █
  983 ns        Histogram: log(frequency) by time      1.48 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.922 μs 10.201 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.008 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.057 μs ± 260.428 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▄▇█▇▄                                              ▁▁▂▁  ▂
  ▇███████▇▆▁▅▆▅▅▅▆▆▄▅▁▃▁▁▁▁▃▁▁▁▃▄▄▅▄▃▁▁▁▁▄▃▁▁▁▃▁▁▁▁▄▇█████ █
  4.92 μs      Histogram: log(frequency) by time      6.14 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.899 μs 15.567 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.949 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.021 μs ± 410.031 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█                                                   ▁    ▁
  ██▄▇█▆▁▃▁▁▁▃▄▅▆▅▅▄▃▁▁▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▅█████▆ █
  6.9 μs       Histogram: log(frequency) by time      8.53 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  114.775 μs280.655 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     115.767 μs                GC (median):    0.00%
 Time  (mean ± σ):   117.878 μs ±   5.456 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▆█▃▂                   ▃▃▂▁                                 ▁
  ████████▆▆▅▅▅▆▆▇▆▅▅▇▇▆██████▇█████▇▆▅▆▄▆▅▄▆▆▅▆▅▅▄▅▄▃▅▅▄▄▆▄▅ █
  115 μs        Histogram: log(frequency) by time        137 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.10.2
  [9f78cca6] SummationByPartsOperators v0.5.89-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 988 evaluations per sample.
 Range (minmax):  47.619 ns77.818 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     48.816 ns               GC (median):    0.00%
 Time  (mean ± σ):   49.355 ns ±  2.079 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▃▅█▇▂▁                                                  
  ▂▃▇██████▇▅▄▄▃▃▃▃▂▂▂▂▂▂▁▂▁▂▂▂▂▂▁▁▂▁▁▁▁▁▁▁▁▂▂▂▂▃▃▂▂▂▂▂▂▂▂▂ ▃
  47.6 ns         Histogram: frequency by time        57.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 231 evaluations per sample.
 Range (minmax):  318.645 ns536.329 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     360.026 ns                GC (median):    0.00%
 Time  (mean ± σ):   361.728 ns ±  17.470 ns   GC (mean ± σ):  0.00% ± 0.00%

                   ▁▃▄▆▅▅▆▆█▇▇▇▅▄▂                            
  ▁▁▁▂▃▃▃▃▃▃▃▃▃▄▅▆▇████████████████▇▆▅▅▅▄▄▄▄▄▄▃▃▃▃▂▃▂▂▂▂▂▁▂▁▁ ▄
  319 ns           Histogram: frequency by time          412 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.124 μs  3.178 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.140 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.153 μs ± 101.830 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇           ▁                                             ▁
  █▅▃▃▅▆▃▇▄▁▁▃█▅▃▄▄▄▄▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▁▃▁▄▃▃▃▁▁▁▁▁▁▁▁▁▃▅▆ █
  1.12 μs      Histogram: log(frequency) by time      1.89 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.270 μs 3.198 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.281 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.294 μs ± 96.934 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                         ▁
  █▄▄▃▄▇▁▁▁▄▁▁▁▃▅▄▄▄▃▄▃▁▁▁▁▃▃▁▁▃▁▁▁▁▁▃▃▄▃▃▃▄▃▁▁▁▁▁▁▁▁▁▁▃▆▇ █
  1.27 μs      Histogram: log(frequency) by time        2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.488 μs  4.908 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.578 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.602 μs ± 149.066 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▁▇█                                                    
  ▂▃▅███▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▂▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂ ▃
  2.49 μs         Histogram: frequency by time        3.42 μs <

 Memory estimate: 240 bytes, allocs estimate: 5.
D_full
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.262 μs 11.495 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.306 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.366 μs ± 319.111 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆█                                                    ▁▁  ▁
  ██▄██▃▁▃▃▃▁▃▅▅▆▄▄▅▆▃▁▃▄▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▃▁▄▅▇████ █
  6.26 μs      Histogram: log(frequency) by time      7.87 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 530 evaluations per sample.
 Range (minmax):  213.664 ns292.396 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     215.913 ns                GC (median):    0.00%
 Time  (mean ± σ):   217.941 ns ±   5.687 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▄▇█▆▅▅▃                      ▂▄▄▃▂▂▁                      ▂
  ▅▇██████████▇▇▄▅▁▁▄▄▄▁▃▄▁▁▅▁▃▃▇████████▇▆▆▆▄▆▄▃▁▃▃▅▃▃▃▄▃▄▄▅ █
  214 ns        Histogram: log(frequency) by time        241 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 191 evaluations per sample.
 Range (minmax):  521.764 ns 10.103 μs   GC (min … max): 0.00% … 92.75%
 Time  (median):     544.052 ns                GC (median):    0.00%
 Time  (mean ± σ):   550.368 ns ± 100.118 ns   GC (mean ± σ):  0.17% ±  0.93%

        ▂▃▇████▆▆▄▃▁                                          
  ▁▁▂▂▄▇█████████████▆▅▄▄▃▂▂▂▂▂▂▂▃▃▄▃▃▃▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  522 ns           Histogram: frequency by time          618 ns <

 Memory estimate: 32 bytes, allocs estimate: 1.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.866 μs 33.583 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.936 μs                GC (median):    0.00%
 Time  (mean ± σ):   14.079 μs ± 983.274 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                           ▁
  █▄▁▁▇▆▇▇▆▄▁▃▁▃▅▅▄▄▁▁▃▁▁▃▁▃▁▃▃▃▁▁▄▁▁▃▃▄▄▁▁▁▃▁▁▁▁▁▁▁▁▁▁▃▃▄▇▇ █
  13.9 μs       Histogram: log(frequency) by time      21.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 525 evaluations per sample.
 Range (minmax):  216.672 ns 1.227 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     221.956 ns               GC (median):    0.00%
 Time  (mean ± σ):   224.488 ns ± 17.100 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▂▆█                                              
  ▂▂▂▂▃▄▆█████▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃▃▄▃▃▃▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂ ▃
  217 ns          Histogram: frequency by time          248 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 188 evaluations per sample.
 Range (minmax):  542.021 ns797.181 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     560.995 ns                GC (median):    0.00%
 Time  (mean ± σ):   565.542 ns ±  17.203 ns   GC (mean ± σ):  0.00% ± 0.00%

           ▂▅▇▆██▆▃▂                                           
  ▁▁▁▂▂▃▅▇██████████▇▅▄▄▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▂▂▂▃▃▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▁ ▃
  542 ns           Histogram: frequency by time          619 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  13.896 μs47.068 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     13.976 μs               GC (median):    0.00%
 Time  (mean ± σ):   14.118 μs ±  1.068 μs   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▅▃▃█▇▁▁▁▃▃▁▃▁▄▅▅▄▄▁▃▁▁▃▁▁▁▁▄▁▃▃▄▃▁▁▁▁▄▁▁▁▁▃▄▁▁▁▁▁▁▁▁▁▄▆▇ █
  13.9 μs      Histogram: log(frequency) by time      21.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 482 evaluations per sample.
 Range (minmax):  223.093 ns359.012 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     226.544 ns                GC (median):    0.00%
 Time  (mean ± σ):   228.507 ns ±   6.592 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▃▆▇█▅▂                                                   
  ▂▃▄████████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▁▁▁▂▂▂▂▃▃▃▄▄▄▃▃▃▂▂▂▂▂▂▂▂▂▁▂▂ ▃
  223 ns           Histogram: frequency by time          249 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  48.771 μs102.522 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     48.992 μs                GC (median):    0.00%
 Time  (mean ± σ):   49.453 μs ±   2.176 μs   GC (mean ± σ):  0.00% ± 0.00%

  ▄█                                                  ▁▁▁   ▁
  ███▆▆█▇▄▃▁▁▄▃▅▅▆▇▄▁▃▄▁▃▁▃▁▃▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▆██████ █
  48.8 μs       Histogram: log(frequency) by time      56.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.454 μs 12.612 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.492 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.547 μs ± 292.430 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅█                                                    ▁   ▁
  ██▃▇█▆▃▃▃▁▃▃▅▆▆▄▅▃▁▃▄▁▃▁▁▆▇▃▁▃▁▁▁▄▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▄▅▇████ █
  5.45 μs      Histogram: log(frequency) by time      6.81 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.10.10
Commit 95f30e51f41 (2025-06-27 09:51 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, znver3)
Threads: 2 default, 0 interactive, 1 GC (on 4 virtual cores)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.15
  [09ab397b] StructArrays v0.7.2
  [9f78cca6] SummationByPartsOperators v0.5.89-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`