Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (minmax):  28.405 ns56.013 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     29.079 ns               GC (median):    0.00%
 Time  (mean ± σ):   29.740 ns ±  2.507 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅██▇▆▅▂▁                          ▁▂▁▁▁                   ▂
  ████████▇▇▇▆▄▅▅▄▃▁▃▃▃▃▁▁▁▁▁▁▄▁▃▃▆▇█████████▇▇▇▇▆▇▇▆▇▇▇███ █
  28.4 ns      Histogram: log(frequency) by time      40.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 595 evaluations per sample.
 Range (minmax):  199.664 ns530.279 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     202.576 ns                GC (median):    0.00%
 Time  (mean ± σ):   206.141 ns ±  20.461 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▆█▄                                                        
  ▁▄███▆▅▄▃▂▂▂▁▁▁▁▁▁▁▁▂▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  200 ns           Histogram: frequency by time          237 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.78 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  388.458 ns949.892 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     392.448 ns                GC (median):    0.00%
 Time  (mean ± σ):   397.167 ns ±  17.235 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂▇█▅▃                   ▂▃▃▂▁                               ▂
  █████▆▆▇█▇▅▁▃▁▃▁▁▃▄▃▁▃▃▆██████▇▇▆▅▄▁▁▃▁▁▆▆▆▄▆▆▆▆▆▆▅▆▅▄▅▅▄▅▆ █
  388 ns        Histogram: log(frequency) by time        473 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.692 μs 12.102 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.730 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.787 μs ± 328.958 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█   ▁▁                                           ▁▁       ▁
  ███▄██▄▁▁▁▁▅▆▆▄▄▄▁▁▁▁▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▃▃▄▇███▇▆▆▆▅▅ █
  4.69 μs      Histogram: log(frequency) by time      5.93 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

FInally, we compare it to a representation as banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.694 μs42.625 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.725 μs               GC (median):    0.00%
 Time  (mean ± σ):   7.004 μs ±  1.400 μs   GC (mean ± σ):  0.00% ± 0.00%

            ▁▁                                            ▁
  █▇▇▅▃▄▅▄▄▅▅██▇▇▇▇▆▄▄▄▆▃▄▅▄▁▄▄▄▄▄▃▄▃▃▅▅▅▆▇▆▅▅▅▅▅▅▄▄▆▅▆▇▇▇ █
  6.69 μs      Histogram: log(frequency) by time       13 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.78 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 202 evaluations per sample.
 Range (minmax):  388.094 ns699.366 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     392.213 ns                GC (median):    0.00%
 Time  (mean ± σ):   396.380 ns ±  17.179 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▆█                                                        
  ▂▃▇██▇▄▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▂▁▁▁▁▁▂▂▁▁▂▁▁▁▁▂▂▂▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  388 ns           Histogram: frequency by time          441 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.013 μs 2.709 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.021 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.032 μs ± 83.881 ns   GC (mean ± σ):  0.00% ± 0.00%

  █  ▂▁▁▂▂▂▂▁▁▁▂▁▂▂▂▂▁▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂ ▂
  1.01 μs        Histogram: frequency by time        1.73 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.570 μs 11.580 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.667 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.729 μs ± 323.804 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▇█▁▁▁                                          ▁▁▁▁    ▂
  ████████▆▄▃▃▄▆▆▅▄▁▃▄▃▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▃▁▃▅▆▇████▇▆▆ █
  5.57 μs      Histogram: log(frequency) by time      7.02 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.226 μs 13.539 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.266 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.472 μs ± 471.714 ns   GC (mean ± σ):  0.00% ± 0.00%

  █  ▁        ▆▅                                        ▂
  █▆▄▇█▆██▅▄▅▆▆▅▄▄██▄▃▅█▁▄▁▃▁▄▅▅▄▄▃▁▃▁▄▆▇██▇▆▆▆▆▇▆▆▆▅▅▅▆▆▆▇ █
  6.23 μs      Histogram: log(frequency) by time      8.33 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  150.039 μs327.548 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     155.654 μs                GC (median):    0.00%
 Time  (mean ± σ):   156.847 μs ±   6.127 μs   GC (mean ± σ):  0.00% ± 0.00%

     ▄█▂                                                        
  ▁▁▄███▆▄▃▃▃▂▃▃▄▅▅▅▄▃▂▃▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  150 μs           Histogram: frequency by time          177 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.78 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (minmax):  44.207 ns319.724 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     46.010 ns                GC (median):    0.00%
 Time  (mean ± σ):   48.886 ns ±  13.814 ns   GC (mean ± σ):  0.00% ± 0.00%

   █   ▁▃           ▁                                       ▁
  ▇██▇▆▅███▆▆▆█▇█▇▇▇▇█▅▃▄▅▄▄▄▄▄▅▁▄▁▃▄▁▄▄▅▄▄▅▆▆▆▅▆▆▆▆▆▆▆▆▇▇▆▆ █
  44.2 ns       Histogram: log(frequency) by time       102 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 221 evaluations per sample.
 Range (minmax):  334.516 ns638.833 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     340.136 ns                GC (median):    0.00%
 Time  (mean ± σ):   343.792 ns ±  13.825 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▄██▃▁                                                    
  ▂▂▃▅█████▇▆▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▂▂▃▃▃▃▃▃▂▂▂▂▂▂▂▂ ▃
  335 ns           Histogram: frequency by time          381 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.709 μs  4.318 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.720 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.785 μs ± 229.894 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                         ▁                ▁
  █▆▄▇▁▁▁▁▁▆▄▄▃▃▁▃▃▃▁▁▁▃▁▁▁▃▆█▇▆▆▇▇▇▅▆█▆▅▆███▇▆▆▆▆▆▇▇▇▆▇▇▇█ █
  1.71 μs      Histogram: log(frequency) by time      2.72 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.310 μs  2.922 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.321 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.337 μs ± 105.783 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▃▄▄▅█▄▄▄▄▃▃▄▅▅▄▅▄▅▄▅▄▄▅▃▅▅▄▄▄▅▅▁▅▁▁▄▁▄▄▄▃▄▃▃▃▄▅▃▁▁▁▁▁▃▆▇ █
  1.31 μs      Histogram: log(frequency) by time      2.04 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.353 μs  4.921 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.403 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.427 μs ± 143.606 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▆█▃                                                  ▁  ▂
  █████▆▃▁▁▁▁▃▃▅▅▅▅▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▆█▄▄▁▁▁▁▃▁▁▃▁▁▃▆▆███ █
  2.35 μs      Histogram: log(frequency) by time      3.21 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (minmax):  8.970 μs 17.562 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.020 μs                GC (median):    0.00%
 Time  (mean ± σ):   9.110 μs ± 499.406 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇                                                     ▁▁  ▁
  █▅▄▆▇▃▁▁▃▁▁▅▅▅▅▃▄▁▃▃▃▁▁▁▁▃▁▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▇▆▇███ █
  8.97 μs      Histogram: log(frequency) by time      11.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 555 evaluations per sample.
 Range (minmax):  207.845 ns366.932 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     211.474 ns                GC (median):    0.00%
 Time  (mean ± σ):   213.503 ns ±   6.789 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▄▅▅▆█▇▅▃                 ▁▁▂▄▄▃▂▁                         ▂
  ▄█████████████▆▆▅▆▄▁▄▅▄▄▄▆██████████▇▇▆▅▆▅▅▅▄▄▅▆▆▆▆▇▇▇▇▆▆▆▅ █
  208 ns        Histogram: log(frequency) by time        238 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 180 evaluations per sample.
 Range (minmax):  585.361 ns 1.062 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     594.544 ns               GC (median):    0.00%
 Time  (mean ± σ):   600.845 ns ± 24.628 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▄█                                                      
  ▂▂▃▅███▅▃▂▂▂▂▂▂▂▂▂▁▂▁▂▂▁▁▂▂▂▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂ ▃
  585 ns          Histogram: frequency by time          677 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.173 μs 21.848 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.223 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.354 μs ± 825.307 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                         ▁                               ▁
  █▆▄▁▅▇▄▄▃▄▁▄▁▁▁▁▁▁▁▃▄▃▄▅███▆▆▅▄▄▄▄▄▃▃▁▁▄▁▄▁▁▃▁▃▃▃▃▁▃▁▃▄▄ █
  7.17 μs      Histogram: log(frequency) by time      11.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 565 evaluations per sample.
 Range (minmax):  205.673 ns991.308 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     208.742 ns                GC (median):    0.00%
 Time  (mean ± σ):   211.187 ns ±  13.566 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▃▇█▅▃▁            ▃▄▄▂▁                                   ▂
  ▃▆███████▇▅▁▁▄▁▃▆▁▃▇██████▆▆▄▁▅▁▁▅▅▆▆▇▅▅▆▁▅▄▄▃▁▃▁▁▃▅▆▃▅▄▃▄▅ █
  206 ns        Histogram: log(frequency) by time        246 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 181 evaluations per sample.
 Range (minmax):  582.735 ns 2.579 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     592.260 ns               GC (median):    0.00%
 Time  (mean ± σ):   600.023 ns ± 44.333 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▄█                                                      
  ▂▂▃▆███▅▄▃▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▂▂▂▁▂▁▂▂▂▂ ▃
  583 ns          Histogram: frequency by time          673 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.484 μs 29.086 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.549 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.652 μs ± 687.726 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅█  ▁                                           ▁▁▁      ▂
  ███▅██▆▃▁▁▁▃▆▆▆▇▆▅▅▃▁▁▃▁▁▁▁▁▃▃▄▁▄▁▁▁▃▃▁▁▃▃▁▁▁▃▁▄▇███▇▇▆▆▅ █
  7.48 μs      Histogram: log(frequency) by time      9.56 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 478 evaluations per sample.
 Range (minmax):  221.686 ns378.295 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     224.013 ns                GC (median):    0.00%
 Time  (mean ± σ):   226.628 ns ±   9.738 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█▆▂▁        ▄▄▂                                            ▂
  █████▆▄▄▄▁▃▃█████▇▇▇▅▅▄▁▅▅▁▃▃▃▄▅▃▄▃▄▄▁▅▁▄▁▁▁▄▃▃▁▁▁▃▃▁▄▁▁▁▃▄ █
  222 ns        Histogram: log(frequency) by time        287 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  211.873 μs  6.382 ms   GC (min … max): 0.00% … 62.02%
 Time  (median):     221.210 μs                GC (median):    0.00%
 Time  (mean ± σ):   247.027 μs ± 299.088 μs   GC (mean ± σ):  8.74% ±  6.87%

   ▄▆█▄▅▆▃▁                                                    ▂
  ▇████████▇▇▇▇▅▅▅▆▅▄▅▄▅▄▅▅▄▄▄▃▁▁▄▁▁▁▁▁▃▃▁▃▁▁▁▁▁▃▃▁▁▁▁▁▄▆▇▇▇▆ █
  212 μs        Histogram: log(frequency) by time        363 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  177.028 μs  4.542 ms   GC (min … max):  0.00% … 93.62%
 Time  (median):     186.236 μs                GC (median):     0.00%
 Time  (mean ± σ):   210.355 μs ± 291.538 μs   GC (mean ± σ):  10.03% ±  6.90%

     █                                                         
  ▃▆▇█▃▃▆▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▁▁▂▁▁▂▂▁▂▁▁▂▁▁▁▁▂▁▁▁▁▁▂▂▂▂▂ ▂
  177 μs           Histogram: frequency by time          323 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.13
  [09ab397b] StructArrays v0.6.18
  [9f78cca6] SummationByPartsOperators v0.5.78 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`