Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (minmax):  29.028 ns51.907 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     29.672 ns               GC (median):    0.00%
 Time  (mean ± σ):   29.964 ns ±  1.460 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▄▇██▄▃▂▁                                          ▁▁▁▁  ▂
  ▇███████████▆▅▅▅▅▆▅▃▅▁▁▁▃▃▁▃▁▁▁▁▃▃▁▁▃▁▃▁▁▃▁▁▁▁▁▁▄▄▆█████ █
  29 ns        Histogram: log(frequency) by time      37.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 595 evaluations per sample.
 Range (minmax):  199.783 ns318.696 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     202.462 ns                GC (median):    0.00%
 Time  (mean ± σ):   204.660 ns ±   6.226 ns   GC (mean ± σ):  0.00% ± 0.00%

     █▇▄▂                                                       
  ▁▂█████▇▆▅▄▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▂▃▂▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  200 ns           Histogram: frequency by time          224 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.86 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 199 evaluations per sample.
 Range (minmax):  415.045 ns623.121 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     418.719 ns                GC (median):    0.00%
 Time  (mean ± σ):   422.327 ns ±  12.027 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▂▆█                                                    
  ▂▅████▆▅▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▂▂▁▁▁▁▂▁▁▁▂▁▁▁▂▁▂▂▂▂▃▃▃▃▃▃▂▂▂▂ ▃
  415 ns           Histogram: frequency by time          460 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.797 μs 11.969 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.843 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.906 μs ± 364.727 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▃█                                         ▁▁▁           ▂
  ███▅█▇▁▁▁▁▁▄▄▆▆▃▄▁▃▁▄▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▆████▇▆▆▅▅▆▅▄▃ █
  4.8 μs       Histogram: log(frequency) by time      6.15 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

FInally, we compare it to a representation as banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.674 μs 13.527 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.704 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.778 μs ± 406.575 ns   GC (mean ± σ):  0.00% ± 0.00%

  █   ▁                                        ▁▁            ▁
  █▁█▅▁▁▃▁▁▅▅▅▄▃▅▄▁▁▁▃▃▄▁▁▄▁▁▃▁▃▁▁▃▁▁▁▁▁▁▁▁▄▇███▇▆▅▅▆▆▄▆▆▆ █
  6.67 μs      Histogram: log(frequency) by time      8.55 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.86 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 202 evaluations per sample.
 Range (minmax):  388.644 ns694.564 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     392.114 ns                GC (median):    0.00%
 Time  (mean ± σ):   395.597 ns ±  12.262 ns   GC (mean ± σ):  0.00% ± 0.00%

     ██                                                        
  ▂▃███▇▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▂▂▂▂▃▃▃▃▂▂▂▂▂▂▂ ▃
  389 ns           Histogram: frequency by time          434 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.027 μs 3.392 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.037 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.047 μs ± 88.165 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                        ▁
  █▄▄▅▇▇▁▁▁▃▃▃▃▄▄▅▅▄▃▄▃▄▄▁▃▃▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄ █
  1.03 μs      Histogram: log(frequency) by time     1.76 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.261 μs 13.527 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.423 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.483 μs ± 372.313 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▂▅▇█                                           ▁▁▁     ▂
  ▅▆████▆▆█▇▃▃▄▄▅▆▆▄▁▄▄▁▁▁▁▁▁▁▃▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▄▆▇████▇▇▇▆ █
  5.26 μs      Histogram: log(frequency) by time      6.82 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.212 μs 15.543 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.242 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.382 μs ± 420.463 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                  ▄▅                        ▁             ▂
  █▃▅▄▇▁▃▁▃▁▃▆▅▅▄▄▁▆██▁▄▃▅▆▃▁▁▁▃▄▁▃▁▅▄▁▃▁▁▁▁▄▇███▆▆▅▆▅▅▆▅▆▅ █
  6.21 μs      Histogram: log(frequency) by time      8.08 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  133.969 μs360.111 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     150.982 μs                GC (median):    0.00%
 Time  (mean ± σ):   154.345 μs ±   9.814 μs   GC (mean ± σ):  0.00% ± 0.00%

                      ▂██▅▄▃▃▃▃▃▅▅▄▃▃▂▂▂▁▂▁▁▁  ▁              ▂
  ▅▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁███████████████████████████▇▇▆▆▅▇▆▅▆▇▅▅ █
  134 μs        Histogram: log(frequency) by time        179 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.86 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (minmax):  44.158 ns77.384 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     45.950 ns               GC (median):    0.00%
 Time  (mean ± σ):   46.461 ns ±  2.234 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▅▇██▄▃▁                          ▁▂▂▂▁              ▂
  ▄▁▁▄▇████████▇▆▇▇▅▅▅▁▄▄▄▃▃▅▅▃▃▄▄▃▁▄▁▅███████▇▆▇▅▅▄▅▅▅▄▃▄ █
  44.2 ns      Histogram: log(frequency) by time      56.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 221 evaluations per sample.
 Range (minmax):  334.059 ns601.434 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     341.588 ns                GC (median):    0.00%
 Time  (mean ± σ):   345.141 ns ±  12.354 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▃█▇▅                                                   
  ▁▁▁▁▂▆█████▇▅▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▁▁▁▁▁▁ ▂
  334 ns           Histogram: frequency by time          382 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.714 μs  4.643 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.727 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.745 μs ± 117.898 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                         ▂
  █▇▅▁▄▆▃▁▃▁▁▁▁▄▃▆▅▁▁▁▄▇▃▁▁▁▃▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▅██ █
  1.71 μs      Histogram: log(frequency) by time      2.48 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.288 μs 3.900 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.295 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.307 μs ± 95.445 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                         ▁
  █▃▄▄█▆▁▃▁▃▁▁▁▃▄▄▄▄▁▄▃▁▁▁▁▁▃▁▃▁▁▃▃▁▁▃▁▃▄▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▄▇ █
  1.29 μs      Histogram: log(frequency) by time     2.01 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.331 μs  5.462 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.408 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.431 μs ± 147.292 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▁▄▇█▂                                                   ▂
  ▆██████▇▅▄▃▁▄▄▄▄▆▅▅▅▇▆▄▁▃▁▁▁▁▄▃▁▁▁▁▃▁▁▁▁▁▁▄▁▃▁▁▁▃▃▄▅████ █
  2.33 μs      Histogram: log(frequency) by time      3.22 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (minmax):  8.756 μs 16.744 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     8.810 μs                GC (median):    0.00%
 Time  (mean ± σ):   8.904 μs ± 529.227 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇                                                    ▁▁   ▁
  █▄▃▆▆▃▃▁▁▃▃▁▄▆▅▄▄▁▃▁▁▁▁▁▃▁▁▁▁▁▃▄▃▁▁▁▁▁▃▃▁▃▃▁▁▁▁▁▁▁▅████▆ █
  8.76 μs      Histogram: log(frequency) by time      11.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 565 evaluations per sample.
 Range (minmax):  204.931 ns260.044 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     207.982 ns                GC (median):    0.00%
 Time  (mean ± σ):   209.649 ns ±   4.734 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▄█▇                                               
  ▂▂▂▂▃▄▇█████▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▂▁▁▁▂▂▂▂▂▃▃▄▄▃▃▃▂▂▂▂▂▂ ▃
  205 ns           Histogram: frequency by time          224 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 187 evaluations per sample.
 Range (minmax):  545.775 ns814.246 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     560.829 ns                GC (median):    0.00%
 Time  (mean ± σ):   565.610 ns ±  15.426 ns   GC (mean ± σ):  0.00% ± 0.00%

           ▁▄██▄▂                                              
  ▂▁▂▂▂▂▃▄▆██████▇▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▃▃▃▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂ ▃
  546 ns           Histogram: frequency by time          615 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.088 μs 21.342 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.146 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.280 μs ± 831.015 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                        ▁▁                                ▁
  █▁▁▃▅▆▅▄▁▃▃▃▁▁▃▄▁▁▁▄▁▁▆███▅▅▅▅▄▃▄▄▄▃▄▄▁▃▁▃▄▃▃▄▃▄▃▃▁▁▁▃▃▃ █
  7.09 μs      Histogram: log(frequency) by time      11.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 565 evaluations per sample.
 Range (minmax):  204.752 ns321.450 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     207.450 ns                GC (median):    0.00%
 Time  (mean ± σ):   211.231 ns ±  11.323 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▃█        ▁▄▃                                            ▂
  ▅█████▇▄▁▁▄▅▅█████▇▇▅▆▄▅▅▇▆▆▇▆▆▅▅▆▆▅▄▃▅▆▅▅▆▅▅▄▆▆▇▇███▇▇▇▆▇▇ █
  205 ns        Histogram: log(frequency) by time        260 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 189 evaluations per sample.
 Range (minmax):  535.286 ns 1.009 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     543.656 ns               GC (median):    0.00%
 Time  (mean ± σ):   549.833 ns ± 25.666 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▃█                                                       
  ▂▃▅███▆▄▃▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  535 ns          Histogram: frequency by time          631 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.281 μs 17.302 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.339 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.428 μs ± 547.618 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅█                                                  ▁      ▁
  ██▁▄▅▁▁▃▃▃▁▄▅▆▄▅▄▃▁▁▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▇███▇▇▄ █
  7.28 μs      Histogram: log(frequency) by time      9.36 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 464 evaluations per sample.
 Range (minmax):  227.407 ns317.683 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     230.062 ns                GC (median):    0.00%
 Time  (mean ± σ):   231.972 ns ±   5.987 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▁▆█▇                                                   
  ▂▃▅█████▆▅▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▁▁▂▂▂▂▂▃▃▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  227 ns           Histogram: frequency by time          252 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  217.045 μs  6.214 ms   GC (min … max):  0.00% … 95.93%
 Time  (median):     224.899 μs                GC (median):     0.00%
 Time  (mean ± σ):   256.165 μs ± 376.365 μs   GC (mean ± σ):  10.65% ±  6.92%

   ▇▇▅▅▅▅▃▂▁▁▂▁                                                ▂
  ██████████████▇▅▃▅▅▅▄▄▄▄▅▃▄▄▄▄▅▃▃▁▃▁▄▁▃▁▁▁▃▃▁▁▃▃▃▁▁▁▃▄▇█▇▆▆ █
  217 μs        Histogram: log(frequency) by time        363 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  180.126 μs  5.800 ms   GC (min … max):  0.00% … 96.28%
 Time  (median):     184.454 μs                GC (median):     0.00%
 Time  (mean ± σ):   215.405 μs ± 369.693 μs   GC (mean ± σ):  12.45% ±  6.97%

  ▆█▄▄▅▅▃▂▁   ▁                                                ▂
  █████████████▇▆▄▆▄▄▄▅▁▅▄▄▄▃▄▄▃▁▄▁▃▃▃▃▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄██▇ █
  180 μs        Histogram: log(frequency) by time        315 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.15
  [09ab397b] StructArrays v0.6.18
  [9f78cca6] SummationByPartsOperators v0.5.86 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`