Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (minmax):  28.837 ns55.189 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     29.401 ns               GC (median):    0.00%
 Time  (mean ± σ):   29.703 ns ±  1.555 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▇█▄▃▂▁                                           ▁▁▁   ▂
  █████████▇▆▆▅▅▄▆▆▄▄▁▃▃▁▁▃▁▁▃▁▁▁▁▄▁▃▁▁▁▁▃▁▁▁▃▁▃▁▃▅▄▇█████ █
  28.8 ns      Histogram: log(frequency) by time      37.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 595 evaluations per sample.
 Range (minmax):  200.224 ns540.793 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     204.146 ns                GC (median):    0.00%
 Time  (mean ± σ):   207.208 ns ±  18.008 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▂▆█▇                                                  
  ▂▂▄█████▇▅▄▃▃▂▂▂▂▂▂▂▂▂▃▃▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  200 ns           Histogram: frequency by time          235 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.81 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  383.325 ns892.310 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     389.690 ns                GC (median):    0.00%
 Time  (mean ± σ):   393.236 ns ±  13.731 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▂▆█                                                   
  ▂▂▃▅████▇▅▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▂▁▂▁▁▂▁▂▁▂▁▂▂▂▂▂▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂ ▃
  383 ns           Histogram: frequency by time          438 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.806 μs 10.085 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.846 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.897 μs ± 274.470 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▃█  ▁                                              ▁▁▁   ▂
  ███▄▇█▇▃▁▁▁▃▅▆▆▅▄▃▃▁▁▁▁▁▁▁▁▁▃▃▁▁▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▆▇████▇ █
  4.81 μs      Histogram: log(frequency) by time      5.99 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

FInally, we compare it to a representation as banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.753 μs 16.182 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.797 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.878 μs ± 420.866 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇ ▁                                                      ▁
  ██▅██▁▃▁▁▅▆▅▅▄▃▃▁▄▃▃▁▃▁▄▁▃▁▄▁▁▁▁▃▃▃▁▃▄▁▆▇██▇▇▆▆▆▅▅▄▆▆▆▆▅▆ █
  6.75 μs      Histogram: log(frequency) by time      8.83 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.81 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  381.404 ns596.931 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     386.729 ns                GC (median):    0.00%
 Time  (mean ± σ):   390.139 ns ±  12.073 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▂▇█                                                      
  ▂▂▃▅████▅▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▁▁▂▂▁▁▂▂▂▁▁▂▁▂▂▂▂▂▂▃▃▃▂▂▂▂▂▂▂▂▂ ▃
  381 ns           Histogram: frequency by time          431 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.023 μs 2.511 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.032 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.042 μs ± 82.269 ns   GC (mean ± σ):  0.00% ± 0.00%

  █  ▂▂▁▂▂▁▁▁▂▁▁▂▂▂▂▂▂▂▂▂▁▁▁▂▂▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  1.02 μs        Histogram: frequency by time        1.74 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.542 μs 13.183 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.580 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.643 μs ± 350.225 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█    ▁                                              ▁▁    ▁
  ██▅██▆▅▃▁▄▇▅▆▆▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▃▅████▇▆ █
  5.54 μs      Histogram: log(frequency) by time      6.89 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.212 μs 13.419 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.240 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.370 μs ± 414.100 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                 ▅                                        ▁
  █▁▄▇▁▁▁▃▁▅▄▅▄▄▄▁▅█▃▄▁▅▅▁▃▄▃▁▄▆▅▃▃▁▃▁▃▃▁▅▇██▇▅▅▆▅▅▄▅▄▅▅▅▅▄ █
  6.21 μs      Histogram: log(frequency) by time      8.25 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  149.019 μs304.991 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     151.573 μs                GC (median):    0.00%
 Time  (mean ± σ):   153.815 μs ±   5.700 μs   GC (mean ± σ):  0.00% ± 0.00%

    ▂▅▇█▆▅▃▂▂▁          ▂▄▅▄▃▂▁▁▁▁▁▁                          ▂
  ▂▆███████████▇▇▇▇▇▅▆▆███████████████▇▇▆▆▇▅▅▄▆▆▇▇▆▅▆▅▆▆▅▅▆▆▆ █
  149 μs        Histogram: log(frequency) by time        174 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.81 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (minmax):  46.011 ns73.574 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     46.639 ns               GC (median):    0.00%
 Time  (mean ± σ):   47.133 ns ±  2.053 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▅█▇                                                     
  ▄████▇▄▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▁▁▂▂▂▁▂▁▂▁▁▂▁▁▂▂▂▂▂▂▃▂▂▂▂▂▂▂▂▂▂▂ ▃
  46 ns           Histogram: frequency by time          56 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 221 evaluations per sample.
 Range (minmax):  334.471 ns693.380 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     341.045 ns                GC (median):    0.00%
 Time  (mean ± σ):   344.814 ns ±  14.335 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▂▄▇███▇▆▅▄▃▃▂▁▁▁                      ▁▂▃▃▂▂▁▁            ▃
  ▆████████████████████▅▇▆▆▆▅▅▅▅▅▅▄▃▁▄▃▄▅▇██████████▇█▆▅▆▅▃▆▄ █
  334 ns        Histogram: log(frequency) by time        387 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.711 μs  4.068 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.727 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.746 μs ± 123.362 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇                                                         ▁
  █▃▅▇█▄▁▁▁▄▁▃▄▅▅▅▁▄▁▁▃▃▁▁▁▁▁▁▁▁▁▁▇▆▁▃▁▁▁▄▃▃▁▁▁▃▃▁▁▃▁▁▄▇██ █
  1.71 μs      Histogram: log(frequency) by time      2.49 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.302 μs  3.571 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.310 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.326 μs ± 105.446 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▄▁▁█▆▁▃▄▁▃▃▁▃▅▄▄▄▃▄▁▁▁▃▄▃▄▃▅▃▁▅▃▄▄▅▅▅▄▅▄▁▁▁▁▁▁▁▁▁▁▁▁▁▅▆▇ █
  1.3 μs       Histogram: log(frequency) by time      2.05 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.595 μs  6.139 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.735 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.761 μs ± 163.756 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▁▆                                                  
  ▂▂▂▂▃▅██▄▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▁▂▂▁▁▁▁▂▁▂▁▂▂▁▁▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂ ▃
  2.59 μs         Histogram: frequency by time        3.58 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (minmax):  8.927 μs 21.120 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.010 μs                GC (median):    0.00%
 Time  (mean ± σ):   9.112 μs ± 556.238 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅█   ▁                                                ▁▁  ▂
  ██▄██▇▄▃▁▃▁▃▃▅▆▄▅▄▁▁▁▁▁▁▁▁▁▁▁▁▃▃▁▃▁▁▃▁▁▁▁▃▁▁▁▁▁▄█▁▁▆████ █
  8.93 μs      Histogram: log(frequency) by time      11.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 565 evaluations per sample.
 Range (minmax):  205.039 ns293.257 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     207.955 ns                GC (median):    0.00%
 Time  (mean ± σ):   209.792 ns ±   5.175 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▄█                                                
  ▂▂▂▂▃▆███▆▅▄▃▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂▁▁▁▁▁▂▂▂▂▂▂▂▂▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  205 ns           Histogram: frequency by time          226 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 162 evaluations per sample.
 Range (minmax):  634.086 ns926.858 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     672.426 ns                GC (median):    0.00%
 Time  (mean ± σ):   677.743 ns ±  19.151 ns   GC (mean ± σ):  0.00% ± 0.00%

                    ▂▅▇▇█▇▄▃                                  
  ▂▁▁▁▁▁▁▁▂▁▂▂▂▂▃▄▅█████████▇▆▅▄▃▃▃▂▂▂▂▂▂▁▁▂▂▂▃▃▃▄▄▄▄▃▃▃▃▃▂▂▂ ▄
  634 ns           Histogram: frequency by time          733 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.073 μs 18.107 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.144 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.237 μs ± 577.122 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█  ▁                                               ▁▁   ▂
  ███▃▇█▆▁▁▁▁▁▁▃▆▅▅▅▄▃▁█▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▄▇▇████▆ █
  7.07 μs      Histogram: log(frequency) by time      9.16 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 550 evaluations per sample.
 Range (minmax):  209.027 ns282.418 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     210.922 ns                GC (median):    0.00%
 Time  (mean ± σ):   212.814 ns ±   5.159 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▃▇█▅▄▄▃▁                                 ▁▃▃▃▃▂▁          ▂
  ▃▄▇████████▆▄▁▆███▇▅▄▅▄▃▁▁▁▁▃▃▁▁▁▃▁▁▁▁▁▃▁▃▆█████████▆▇▆▇▅▇▆ █
  209 ns        Histogram: log(frequency) by time        229 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 181 evaluations per sample.
 Range (minmax):  583.961 ns946.304 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     595.033 ns                GC (median):    0.00%
 Time  (mean ± σ):   600.433 ns ±  18.105 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▁▅▇█                                                
  ▂▂▂▃▄▆█████▇▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▂▁▂▂▂▂▂▂▃▃▃▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  584 ns           Histogram: frequency by time          653 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.389 μs 17.633 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.447 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.527 μs ± 435.778 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅█                                                         ▁
  ██▃▆▇▄▁▁▁▃▁▁▁▄▄▅▅▄▁▃▁▃▁▃▃▁▁▁▁▃▁▁▁▁▃▁▁▁▁▃▁▁▁▄▃▁▃▁▅▇████▇▆ █
  7.39 μs      Histogram: log(frequency) by time      9.52 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 478 evaluations per sample.
 Range (minmax):  224.142 ns344.033 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     226.617 ns                GC (median):    0.00%
 Time  (mean ± σ):   228.698 ns ±   6.908 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▄█                                                    
  ▂▃▇███▇▅▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▂▁▁▁▂▂▂▃▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂ ▃
  224 ns           Histogram: frequency by time          252 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  228.176 μs  5.045 ms   GC (min … max): 0.00% … 94.18%
 Time  (median):     234.608 μs                GC (median):    0.00%
 Time  (mean ± σ):   266.069 μs ± 344.898 μs   GC (mean ± σ):  9.89% ±  7.21%

  ▃▇█▄▅▆▄▃▂▁▁▁▁                                                ▂
  ███████████████▆▆▅▅▅▄▄▄▃▃▆▄▄▄▃▃▄▃▁▄▁▁▄▄▃▁▃▃▃▁▁▁▁▁▃▁▁▁▄▇▇█▄▇ █
  228 μs        Histogram: log(frequency) by time        374 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  171.431 μs  4.865 ms   GC (min … max):  0.00% … 95.60%
 Time  (median):     175.979 μs                GC (median):     0.00%
 Time  (mean ± σ):   206.767 μs ± 342.034 μs   GC (mean ± σ):  12.64% ±  7.31%

  ▄█▄▃▅▄▃▂▁                                                    ▂
  █████████████▇▆▆▅▅▄▁▃▄▄▄▃▁▃▃▄▃▁▄▁▃▁▃▁▁▃▃▁▁▁▁▁▁▁▃▃▁▁▁▁▁▅▇█▇▅ █
  171 μs        Histogram: log(frequency) by time        313 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.13
  [09ab397b] StructArrays v0.6.18
  [9f78cca6] SummationByPartsOperators v0.5.81 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`