Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (minmax):  29.573 ns78.589 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     30.540 ns               GC (median):    0.00%
 Time  (mean ± σ):   30.788 ns ±  1.564 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▄▆█                                                  
  ▃▄▆████▇▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▁▂▂▁▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  29.6 ns         Histogram: frequency by time        37.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 232 evaluations per sample.
 Range (minmax):  311.401 ns687.233 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     336.103 ns                GC (median):    0.00%
 Time  (mean ± σ):   339.156 ns ±  12.020 ns   GC (mean ± σ):  0.00% ± 0.00%

                      ▁▄▇█                                   
  ▂▂▁▂▂▂▁▂▂▂▂▂▂▂▂▂▃▄▅▇█████▇▄▃▃▃▂▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▃▃▃▃▃▃▂▂▂▂▂▂ ▃
  311 ns           Histogram: frequency by time          375 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.84 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 204 evaluations per sample.
 Range (minmax):  379.926 ns657.064 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     384.784 ns                GC (median):    0.00%
 Time  (mean ± σ):   388.166 ns ±  12.407 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▆█                                                       
  ▂▂▃████▅▃▂▂▂▁▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▂▂▂▂▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  380 ns           Histogram: frequency by time          431 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.902 μs 10.945 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.946 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.997 μs ± 285.020 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▃▇                                                 ▁▁    ▂
  ███▇▃▅▃▃▁▁▁▁▁▄▅▆▆▅▅▃▁▃▁▁▁▁▁▁▃▁▁▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▆████▇▇ █
  4.9 μs       Histogram: log(frequency) by time      6.03 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

FInally, we compare it to a representation as banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.677 μs 15.092 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.700 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.784 μs ± 439.685 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▄▄▃▁▃▁▁▁▄▇▆▅▅▄▄▄▃▃▄▃▃▃▄▁▄▄▄▃▃▁▃▃▁▁▁▃▃▃▇███▇▆▆▆▆▆▇▅▆▆▅▅▆ █
  6.68 μs      Histogram: log(frequency) by time      8.67 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.84 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  382.537 ns595.300 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     387.719 ns                GC (median):    0.00%
 Time  (mean ± σ):   390.977 ns ±  11.784 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▂██                                                   
  ▂▂▃▅████▅▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▂▁▁▂▂▁▁▂▁▁▁▂▁▂▂▁▂▂▂▃▃▃▃▂▂▂▂▂▂▂▂ ▃
  383 ns           Histogram: frequency by time          430 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.011 μs 2.824 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.020 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.030 μs ± 82.510 ns   GC (mean ± σ):  0.00% ± 0.00%

  █ ▁                                                       ▁
  ██▅▁▆▇▁▃▁▁▃▄▃▃▃▄▄▄▄▃▁▅▁▃▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄ █
  1.01 μs      Histogram: log(frequency) by time     1.72 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.243 μs 13.519 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.275 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.333 μs ± 326.007 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆                                                   ▁▁   ▁
  ██▇▃▅▃▄▁▁▃▁▁▁▅▆▅▅▄▃▄▃▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▃▆███▇▇ █
  5.24 μs      Histogram: log(frequency) by time      6.57 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.224 μs 15.727 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.264 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.459 μs ± 503.034 ns   GC (mean ± σ):  0.00% ± 0.00%

  █               ▂▆▁                                        ▂
  █▄▄▅▅▅▃▄▁▅▆▅▅▆▆███▃▃▃▃▄▃▁▃▃▄▇▃▄▁▁▃▁▄▁▁███▇▆▇▇▇▆▅▅▆▅▆▅▅▅▃▆ █
  6.22 μs      Histogram: log(frequency) by time      8.37 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  133.639 μs314.598 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     135.042 μs                GC (median):    0.00%
 Time  (mean ± σ):   137.203 μs ±   6.030 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▅█▆▃▂▁▁▁           ▄▄▄▂                                    ▂
  ██████████▇▆▆▄▅▆▄▃▆█████████████▇█▇▆▅▅▆▄▄▄▄▅▅▄▄▄▄▄▄▅▆▄▆▄▄▄▅ █
  134 μs        Histogram: log(frequency) by time        159 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.84 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (minmax):  46.092 ns74.679 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     46.690 ns               GC (median):    0.00%
 Time  (mean ± σ):   47.163 ns ±  1.929 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂▇██▇▆▄▃▂▂▁                                    ▁▂▂▂▁▁     ▂
  ███████████▇▆▇▆▇▆▆▅▆▅▃▁▁▃▁▃▁▁▁▁▃▁▁▃▁▁▁▁▁▃▁▁▁▁▃▇███████▇▇▆ █
  46.1 ns      Histogram: log(frequency) by time      54.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 222 evaluations per sample.
 Range (minmax):  334.450 ns613.356 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     340.590 ns                GC (median):    0.00%
 Time  (mean ± σ):   344.322 ns ±  13.599 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▁▇█▆                                                     
  ▂▂▃▅█████▆▅▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  334 ns           Histogram: frequency by time          384 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.718 μs  4.502 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.732 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.756 μs ± 137.742 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇                                                         ▁
  █▅▅██▃▄▃▁▁▄▁▃▄▅▅▁▄▄▃▅▁▁▃▃▄▃▄▃▄▁▄▁▃▃▄▃▁▄▁▁▄▃▃▁▁▃█▄▄▁▃▅▇█▇ █
  1.72 μs      Histogram: log(frequency) by time      2.49 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.296 μs  3.824 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.308 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.321 μs ± 100.167 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▁▁▁▁▄▁▃▁▁▁▁▃▁▁▃▃▄▅▄▃▄▃▁▁▄▁▁▃▁▃▃▃▁▃▄▅▄▁▃▁▁▁▁▃▁▁▃▁▁▁▁▁▁▃▆▇ █
  1.3 μs       Histogram: log(frequency) by time      2.02 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.582 μs  5.993 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.815 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.845 μs ± 174.083 ns   GC (mean ± σ):  0.00% ± 0.00%

            ▂▇█▁                                            
  ▂▁▂▂▂▂▂▃▄▆████▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  2.58 μs         Histogram: frequency by time        3.66 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (minmax):  8.753 μs 18.448 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     8.800 μs                GC (median):    0.00%
 Time  (mean ± σ):   8.897 μs ± 536.245 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                     ▁▁  ▂
  █▄▃█▇▁▁▁▁▁▁▁▁▄▆▅▄▄▁▁▁▃▁▁▄▃▁▁▁▄▃▁▃▁▁▁▃▁▁▁▁▃▃▁▁▁▁▃▁▁▄▆████ █
  8.75 μs      Histogram: log(frequency) by time      11.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 550 evaluations per sample.
 Range (minmax):  207.896 ns303.258 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     210.485 ns                GC (median):    0.00%
 Time  (mean ± σ):   212.259 ns ±   5.208 ns   GC (mean ± σ):  0.00% ± 0.00%

        ██                                                     
  ▂▂▂▂▄████▅▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂▂▂▂▂▂▂▃▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  208 ns           Histogram: frequency by time          229 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 29 evaluations per sample.
 Range (minmax):  907.552 ns 2.070 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     965.276 ns               GC (median):    0.00%
 Time  (mean ± σ):   974.051 ns ± 51.994 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▁▃▅▇█▃▂▁                                        ▁▁▁ ▂
  ▅▅▅▇▇███████████▄▅▄▄▆▄▅▅▅▃▄▃▄▁▃▄▄▁▃▁▄▄▃▁▁▄▃▄▅▃▅▃▁▁▃▁▇████ █
  908 ns        Histogram: log(frequency) by time      1.22 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.356 μs 21.139 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.419 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.514 μs ± 601.684 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█                                               ▁▁        ▁
  ██▁▄▄▄▁▃▁▁▁▃▅▆▅▅▃▄▃▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▅████▇▅▅▄▅▄ █
  7.36 μs      Histogram: log(frequency) by time      9.52 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 555 evaluations per sample.
 Range (minmax):  206.944 ns279.259 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     209.670 ns                GC (median):    0.00%
 Time  (mean ± σ):   211.392 ns ±   4.982 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▅█                                                     
  ▂▂▂▂▃▇██▇▄▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▂▁▁▂▂▂▂▂▂▂▂▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  207 ns           Histogram: frequency by time          228 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 176 evaluations per sample.
 Range (minmax):  601.807 ns 1.122 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     621.330 ns               GC (median):    0.00%
 Time  (mean ± σ):   636.268 ns ± 56.908 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▃▇█  ▁▄▄▁                                                ▂
  ▇███▇▆█████▇▇▆▆▆▇▇▆▆▅▅▅▃▅▃▄▄▄▁▃▄▁▃▁▃▃▁▁▄▅▄▅▄▅▅▆▆▆▆▆▇▇▇███ █
  602 ns        Histogram: log(frequency) by time       928 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.376 μs 17.813 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.429 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.504 μs ± 439.049 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆█                                                    ▁▁   ▁
  ██▁▅▄▃▁▃▁▁▁▁▃▄▆▄▆▁▄▃▃▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▃▆████▇ █
  7.38 μs      Histogram: log(frequency) by time      9.34 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 460 evaluations per sample.
 Range (minmax):  235.026 ns355.533 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     237.465 ns                GC (median):    0.00%
 Time  (mean ± σ):   239.446 ns ±   6.391 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▅█                                                   
  ▂▂▄████▇▅▃▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▂▁▁▁▂▁▁▂▁▂▂▁▂▂▂▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  235 ns           Histogram: frequency by time          259 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  211.957 μs  6.624 ms   GC (min … max):  0.00% … 95.95%
 Time  (median):     220.853 μs                GC (median):     0.00%
 Time  (mean ± σ):   256.440 μs ± 425.821 μs   GC (mean ± σ):  12.04% ±  6.96%

   ▆█▅▅▆▄▂▁▁▁                                                  ▂
  ████████████▇▇▄▅▄▅▄▄▃▃▄▄▄▅▄▁▃▃▃▄▁▄▃▁▁▅▃▁▄▁▁▁▃▅▇█▆▆▇▆▄▃▃▁▁▁▃ █
  212 μs        Histogram: log(frequency) by time        387 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  178.764 μs  6.403 ms   GC (min … max):  0.00% … 96.61%
 Time  (median):     184.780 μs                GC (median):     0.00%
 Time  (mean ± σ):   220.373 μs ± 435.571 μs   GC (mean ± σ):  14.33% ±  7.00%

  ▂▇█▅▄▅▄▃▂▁                                                   ▂
  ████████████▇▇▇▅▅▃▄▄▄▄▃▃▅▅▄▄▁▄▃▃▄▁▁▁▃▃▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▆▇▇ █
  179 μs        Histogram: log(frequency) by time        318 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.15
  [09ab397b] StructArrays v0.6.18
  [9f78cca6] SummationByPartsOperators v0.5.84 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`