Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 994 evaluations per sample.
 Range (minmax):  30.107 ns61.141 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     31.245 ns               GC (median):    0.00%
 Time  (mean ± σ):   31.572 ns ±  1.660 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▂▅▇█▇▆▄▂▁                                        ▁▁▁▁  ▂
  ▄▄▆██████████▇▅▃▅▇▅▆▃▄▄▁▄▄▁▁▃▁▄▁▃▁▃▃▁▃▁▃▁▁▄▃▁▁▁▁▅▆▇██████ █
  30.1 ns      Histogram: log(frequency) by time      39.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 606 evaluations per sample.
 Range (minmax):  198.739 ns428.589 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     200.789 ns                GC (median):    0.00%
 Time  (mean ± σ):   202.937 ns ±   6.540 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▆█▅▁                                                       
  ▁▃█████▆▆▄▄▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  199 ns           Histogram: frequency by time          218 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.83 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 204 evaluations per sample.
 Range (minmax):  379.333 ns656.618 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     383.706 ns                GC (median):    0.00%
 Time  (mean ± σ):   387.325 ns ±  13.323 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▅█                                                    
  ▂▃▇███▇▄▃▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▁▂▁▁▂▁▁▁▂▁▁▁▁▁▁▁▁▂▂▂▂▃▃▃▃▂▂▂▂▂▂▂▂▂▂ ▃
  379 ns           Histogram: frequency by time          430 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.534 μs 12.121 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.568 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.625 μs ± 352.645 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█                                                   ▁▁    ▁
  ██▆▁▄▅▃▃▁▁▁▄▆▅▆▆▇▃▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▄▆▇████▇ █
  4.53 μs      Histogram: log(frequency) by time       5.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

FInally, we compare it to a representation as banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.701 μs 16.755 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.734 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.805 μs ± 413.080 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                               ▁▁         ▂
  █▇▅▅▁▁▃▁▁▁▆▆▄▅▄▃▁▃▃▃▁▃▃▃▁▁▁▃▁▃▁▁▁▁▁▁▁▃▁▁▁▁▄▃▇▇██▇▇▇▆▆▆▆▆ █
  6.7 μs       Histogram: log(frequency) by time      8.52 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.83 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  381.650 ns780.768 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     386.384 ns                GC (median):    0.00%
 Time  (mean ± σ):   390.011 ns ±  14.452 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▂▇█                                                    
  ▂▃▅████▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▁▁▁▁▁▁▁▂▂▁▂▁▁▁▂▂▂▂▂▃▃▃▃▃▂▂▂▂▂▂▂▂ ▃
  382 ns           Histogram: frequency by time          431 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 17 evaluations per sample.
 Range (minmax):  973.588 ns 2.689 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     980.059 ns               GC (median):    0.00%
 Time  (mean ± σ):   991.480 ns ± 72.817 ns   GC (mean ± σ):  0.00% ± 0.00%

  █    ▁                                                      ▁
  █▄▄▃█▇▄▁▄▁▁▃▁▃▄▆▁▄▃▄▃▁▃▃▁▄▁▄▁▁▃▃▄▄▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▆▇█▇ █
  974 ns        Histogram: log(frequency) by time      1.44 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.186 μs 12.313 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.342 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.407 μs ± 393.701 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▃▆▇█▁                                          ▁▁▁     ▂
  ▅▇█████▄▁▁▃▁▄▆▇█▆▅▃▃▃▄▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▁▆▇█████▇▆▆ █
  5.19 μs      Histogram: log(frequency) by time      6.76 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.213 μs 13.485 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.242 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.412 μs ± 433.146 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                 ▅                                         ▁
  █▄▅▄▄▃▁▁▅▅▅▅▄█▆██▅▁▁▃▄▁▁▁▃▄▄▄▁▁▁▁▁▃▁▁▄▆▇█▇▇▆▅▅▆▅▅▆▅▆▅▅▆▅▅ █
  6.21 μs      Histogram: log(frequency) by time      8.34 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  148.067 μs387.324 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     150.271 μs                GC (median):    0.00%
 Time  (mean ± σ):   152.437 μs ±   6.523 μs   GC (mean ± σ):  0.00% ± 0.00%

    ▂█▆                                                    
  ▁▃████▅▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂▃▃▃▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  148 μs           Histogram: frequency by time          174 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.83 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 988 evaluations per sample.
 Range (minmax):  47.781 ns107.216 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     49.495 ns                GC (median):    0.00%
 Time  (mean ± σ):   49.880 ns ±   2.150 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▁▂▃▃▅▆█▄▂                                              
  ▂▄▇██████████▆▄▃▃▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▂▁▁▂▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▂▂▂▂ ▄
  47.8 ns         Histogram: frequency by time           58 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 228 evaluations per sample.
 Range (minmax):  324.601 ns684.127 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     330.750 ns                GC (median):    0.00%
 Time  (mean ± σ):   334.308 ns ±  15.848 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▃▇██▄▁                                                   
  ▂▂▄▆███████▆▄▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▂▂▂▂▂▂▂ ▃
  325 ns           Histogram: frequency by time          371 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.715 μs  5.151 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.724 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.745 μs ± 132.994 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▄▅▃█▅▃▁▁▁▃▃▃▄▄▃▃▃▃▃▁▁▁▃▁▁▁▁▁▁▃▁▁▁▄▁▁▁▁▃▁▁▁▄▁▁▁▃▁▁▄▃▅▆██▇ █
  1.72 μs      Histogram: log(frequency) by time      2.51 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.298 μs 3.438 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.312 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.326 μs ± 97.456 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇                                                        ▁
  █▁▁▁▃▄▁▁▁▁▁▃▃▄▅▃▁▄▁▁▃▁▁▁▄▃▄▃▃▁▁▃▄▄▅▄▅▄▄▁▁▁▁▁▁▃▁▁▁▁▁▁▁▄▅▇ █
  1.3 μs       Histogram: log(frequency) by time     2.05 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.366 μs  6.029 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.426 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.452 μs ± 157.941 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▇█▂                                                ▁▁  ▂
  ██████▄▅▁▃▃▃▃▁▅▆▅▄▃▁▃▄▁▁▄▁▁▁▁▁▁▃▁▃▁▁▁▃▁▁▃▆▆▅▃▄▃▄▁▁▃▅████ █
  2.37 μs      Histogram: log(frequency) by time      3.27 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (minmax):  8.937 μs 25.047 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     8.987 μs                GC (median):    0.00%
 Time  (mean ± σ):   9.108 μs ± 680.484 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                               ▁▁         ▂
  █▄█▇▁▁▁▄▁▁▅█▅▄▄▁▄▁▁▃▃▁▃▃▁▁▁▁▁▄▁▁▃▁▁▁▁▃▁▁▁▁▁▄▆▇███▇▅▅▅▄▃▄ █
  8.94 μs      Histogram: log(frequency) by time        12 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 555 evaluations per sample.
 Range (minmax):  207.486 ns284.766 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     209.437 ns                GC (median):    0.00%
 Time  (mean ± σ):   211.186 ns ±   4.927 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▃▇█▇▆▄▃▁                                  ▁▃▄▄▃▂▁         ▂
  ▅▃▇████████▄▃▅▇▇▇▆▆▅▃▃▃▁▁▁▁▁▁▃▁▁▁▁▁▃▁▃▁▁▁▁▅▇████████▇▇▅▇▇▇▆ █
  207 ns        Histogram: log(frequency) by time        226 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 186 evaluations per sample.
 Range (minmax):  549.898 ns 1.046 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     566.435 ns               GC (median):    0.00%
 Time  (mean ± σ):   571.236 ns ± 17.448 ns   GC (mean ± σ):  0.00% ± 0.00%

             ▁▆██▃▁                                           
  ▂▁▁▂▂▂▂▂▃▄▆██████▇▄▃▃▂▂▂▂▂▂▂▂▂▁▂▂▁▂▂▂▁▂▂▂▂▂▂▃▃▃▄▃▃▃▃▂▂▂▂▂▂ ▃
  550 ns          Histogram: frequency by time          616 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.096 μs 18.677 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.158 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.264 μs ± 675.023 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆█                                            ▁▁           ▁
  ██▄▃▃▃▁▃▃▅▇▆▄▅▁▁▁▁▁▁▄▁▁▁▁▁▄▄▁▁▁▁▁▁▁▁▁▁▁▁▃▄▆███▇▅▅▅▅▄▅▄▄▅ █
  7.1 μs       Histogram: log(frequency) by time       9.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 570 evaluations per sample.
 Range (minmax):  204.979 ns381.539 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     207.668 ns                GC (median):    0.00%
 Time  (mean ± σ):   209.422 ns ±   5.428 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▄█▇                                                
  ▂▂▂▂▃▇████▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▂▃▃▃▄▃▃▃▂▂▂▂▂▂▂ ▃
  205 ns           Histogram: frequency by time          224 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 189 evaluations per sample.
 Range (minmax):  532.106 ns 1.160 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     545.095 ns               GC (median):    0.00%
 Time  (mean ± σ):   550.297 ns ± 20.351 ns   GC (mean ± σ):  0.00% ± 0.00%

           ▂▇█                                            
  ▂▁▁▂▂▂▂▄▆█████▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂ ▃
  532 ns          Histogram: frequency by time          596 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.281 μs 21.285 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.349 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.441 μs ± 561.668 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▃█                                                 ▁▁▁   ▂
  ███▄▃▅▄▁▁▁▁▃▁▅▆▆▅▃▁▄▃▁▁▁▁▃▁▃▁▁▁▄▁▁▁▁▁▁▁▁▁▅▇▁▁▁▁▁▃▃▅▇████▇ █
  7.28 μs      Histogram: log(frequency) by time      9.33 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 446 evaluations per sample.
 Range (minmax):  230.769 ns364.022 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     234.070 ns                GC (median):    0.00%
 Time  (mean ± σ):   235.943 ns ±   6.032 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▄▅█▄▃                                                   
  ▂▂▃▆███████▆▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂▁▂▂▁▂▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▃▂▂▂▂▂▂ ▃
  231 ns           Histogram: frequency by time          255 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  214.311 μs  5.806 ms   GC (min … max):  0.00% … 95.47%
 Time  (median):     225.767 μs                GC (median):     0.00%
 Time  (mean ± σ):   256.195 μs ± 361.926 μs   GC (mean ± σ):  10.24% ±  6.91%

   ▅█                                                  
  ▃██▇█▇▅▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▁▁▁▂▂▁▂▁▁▂▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂▁▂▂ ▃
  214 μs           Histogram: frequency by time          392 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  173.524 μs  5.427 ms   GC (min … max):  0.00% … 95.35%
 Time  (median):     181.078 μs                GC (median):     0.00%
 Time  (mean ± σ):   208.796 μs ± 358.209 μs   GC (mean ± σ):  12.46% ±  6.97%

   ▂▃                                                          
  ▅██▄▄▄▅▃▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▂▂▁▂▂▂▂▁▂▁▁▁▂▁▁▂▁▁▁▁▁▁▂▂▁▁▁▂▁▁▁▁▂▂▂ ▂
  174 μs           Histogram: frequency by time          312 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.15
  [09ab397b] StructArrays v0.6.18
  [9f78cca6] SummationByPartsOperators v0.5.83 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`