Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (minmax):  28.868 ns90.511 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     29.905 ns               GC (median):    0.00%
 Time  (mean ± σ):   30.169 ns ±  1.709 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▃█                                                   
  ▃▅▆▄▄██▅▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▂▁▁▂▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂ ▃
  28.9 ns         Histogram: frequency by time        37.7 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 600 evaluations per sample.
 Range (minmax):  198.788 ns426.982 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     200.743 ns                GC (median):    0.00%
 Time  (mean ± σ):   203.144 ns ±   8.664 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▂█▅                                                         
  ▃████▆▅▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂ ▃
  199 ns           Histogram: frequency by time          226 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.82-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  378.394 ns600.335 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     384.714 ns                GC (median):    0.00%
 Time  (mean ± σ):   388.220 ns ±  12.479 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▂▅▇██▇▅▄▂                                   ▁▂▃▃▂▂▁      ▂
  ▃▃▆██████████▆▆▅▅▆▇█▆▆▅▅▅▄▃▃▁▃▁▁▃▁▃▁▁▃▁▁▁▁▄▅▁▆█████████▇▇▇▅ █
  378 ns        Histogram: log(frequency) by time        428 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.686 μs 11.407 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.723 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.785 μs ± 345.399 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅█   ▂                                           ▁         ▁
  █████▅▅▆▅▆▅▆▆▇▄▄▁▄▄▃▃▃▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▇████▇▇▆▆▄▅ █
  4.69 μs      Histogram: log(frequency) by time      5.99 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

FInally, we compare it to a representation as banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.676 μs 16.425 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.707 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.778 μs ± 394.383 ns   GC (mean ± σ):  0.00% ± 0.00%

  █   ▂▁                                                     ▂
  █▇██▁▅▆▅▅▄▆▅▅▆▅▆▃▃▄▃▃▁▃▃▃▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▃▁▇▇██▇▇▇▇▆▆▆ █
  6.68 μs      Histogram: log(frequency) by time      8.48 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.82-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 202 evaluations per sample.
 Range (minmax):  395.842 ns640.109 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     402.238 ns                GC (median):    0.00%
 Time  (mean ± σ):   405.911 ns ±  13.522 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▃▆▇▅▂                                                    
  ▂▃▄▇█████▇▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▁▂▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂ ▃
  396 ns           Histogram: frequency by time          449 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.265 μs  4.610 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.385 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.397 μs ± 109.469 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▁▄▇                                                  ▂
  ▄▁▁▄▅███▆▆▇█▇▄▄▄▄▄▄▃▄▄▄▃▃▅▃▁▃▃▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▆▇ █
  1.27 μs      Histogram: log(frequency) by time      2.12 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.260 μs 14.425 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.295 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.361 μs ± 388.696 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆  ▁                                              ▁      ▁
  ██▅▇██▄▆▆▆▄▅▅▅▅▅▆▅▅▄▃▁▃▄▁▁▁▃▁▁▄▁▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▄▇████▇▆▆ █
  5.26 μs      Histogram: log(frequency) by time      6.67 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.252 μs 17.262 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.282 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.460 μs ± 467.177 ns   GC (mean ± σ):  0.00% ± 0.00%

  █  ▁▁            ▆                                         ▁
  █▆▆██▆▆▅█▅▇▆▆▅▃▅██▅▅▇▅▃▁▅▄▄▃▃▄▆▁▃▃▃▁▁▁▅▇██▆▆▆▆▆▆▅▅▅▅▅▅▅▅▆ █
  6.25 μs      Histogram: log(frequency) by time      8.36 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  136.726 μs408.775 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     140.282 μs                GC (median):    0.00%
 Time  (mean ± σ):   142.022 μs ±   5.737 μs   GC (mean ± σ):  0.00% ± 0.00%

        ▁▃▆█                                               
  ▁▁▂▂▄▆█████▇▅▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▃▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  137 μs           Histogram: frequency by time          158 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.82-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (minmax):  46.629 ns77.313 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     47.288 ns               GC (median):    0.00%
 Time  (mean ± σ):   47.748 ns ±  1.963 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂▆▇█▇▆▅▃▂▁                                      ▁▂▂▂▁▁    ▃
  ████████████▆▆▆▆▆▇▆▅▃▃▄▃▅▃▁▁▁▁▁▅▁▁▁▁▃▁▁▁▁▁▁▁▁▄▆████████▇▅ █
  46.6 ns      Histogram: log(frequency) by time      55.6 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 228 evaluations per sample.
 Range (minmax):  326.048 ns701.882 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     333.741 ns                GC (median):    0.00%
 Time  (mean ± σ):   337.645 ns ±  16.053 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▂▆█▇▆▂▁                                                  
  ▂▂▃▅███████▇▄▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂ ▃
  326 ns           Histogram: frequency by time          380 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.712 μs  5.300 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.725 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.746 μs ± 136.528 ns   GC (mean ± σ):  0.00% ± 0.00%

  █   ▁                            ▁                        ▂
  █▄▅█▇▅▄▅▅▅▄▅▄▃▅▅▄▅▄▃▄▃▄▁▁▁▃▁▁▃▁▁█▁▃▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▃▃▆▆▇▇ █
  1.71 μs      Histogram: log(frequency) by time      2.49 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.304 μs 3.619 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.312 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.324 μs ± 97.996 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                         ▁
  █▃▄▃█▇▄▁▃▃▁▁▃▄▁▁▄▅▄▄▁▁▃▃▁▁▁▄▃▃▁▄▁▄▄▄▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▇ █
  1.3 μs       Histogram: log(frequency) by time     2.03 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.367 μs  7.618 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.436 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.461 μs ± 162.498 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▄▆██▄▃▁                                               ▁  ▂
  █████████▇▆▃▁▁▁▄▄▅▅▄▄▄▁▃▄▁▃▁▁▃▃▁▃▃▁▁▃▁▃▃▁▁▁▁▁▁▁▁▁▁▁▄▇████ █
  2.37 μs      Histogram: log(frequency) by time      3.26 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (minmax):  8.750 μs 20.328 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     8.816 μs                GC (median):    0.00%
 Time  (mean ± σ):   8.958 μs ± 575.303 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇ ▆▅                                              ▁      ▂
  ██▅███▅▅▇▆▆▅▅▆▆▅▆▄▃▆▄▁▄▁▁▁▁▃▁▃▄▄▄▁▃▃▁▁▄▃▄▁▁▄▁▁▃▄▅████▇▆▆▆ █
  8.75 μs      Histogram: log(frequency) by time      11.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 525 evaluations per sample.
 Range (minmax):  215.011 ns318.920 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     219.133 ns                GC (median):    0.00%
 Time  (mean ± σ):   221.040 ns ±   6.064 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▅█                                                
  ▂▂▂▃▃▅████▇▅▃▂▂▂▂▂▂▂▁▂▂▁▂▂▁▂▂▁▂▂▂▂▂▃▃▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  215 ns           Histogram: frequency by time          242 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 166 evaluations per sample.
 Range (minmax):  636.976 ns934.699 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     665.886 ns                GC (median):    0.00%
 Time  (mean ± σ):   670.983 ns ±  19.225 ns   GC (mean ± σ):  0.00% ± 0.00%

                 ▁▅▇█▅▂                                        
  ▁▁▁▁▁▁▁▂▂▂▃▃▄▅▇██████▆▄▃▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁ ▂
  637 ns           Histogram: frequency by time          729 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.166 μs 22.129 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.213 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.308 μs ± 593.720 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆  ▁▁                                                    ▁
  ██▅▅██▆▄▅▅▅▅▅▅▅▆▆▅▅▁▃▃▄▁▁▃▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▁▁▁▁▁▄▅▇▇███▇ █
  7.17 μs      Histogram: log(frequency) by time       9.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 550 evaluations per sample.
 Range (minmax):  209.100 ns293.513 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     210.530 ns                GC (median):    0.00%
 Time  (mean ± σ):   212.401 ns ±   5.532 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▃▇█▆▄▃▂                              ▂▃▄▃▂▁                 ▂
  ▅███████▆▃▄▇▇▇▆▄▅▃▄▃▃▁▁▁▃▁▁▃▁▄▄▄▁▁▁▅▇████████▆▆▆▇▇▆▆▆▅▅▆▅▄▅ █
  209 ns        Histogram: log(frequency) by time        231 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 184 evaluations per sample.
 Range (minmax):  560.179 ns918.837 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     581.522 ns                GC (median):    0.00%
 Time  (mean ± σ):   586.625 ns ±  18.311 ns   GC (mean ± σ):  0.00% ± 0.00%

               ▂▆█▅▁                                           
  ▂▂▂▂▂▂▂▂▂▂▃▄▆█████▇▃▃▃▂▂▂▂▂▂▂▁▂▁▂▁▂▂▂▂▂▂▂▃▃▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂ ▃
  560 ns           Histogram: frequency by time          640 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.248 μs 22.545 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.436 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.515 μs ± 534.144 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▃▅▆▇  ▁▁                                           ▁▁▁   ▂
  ▇████▇███▅▅▆▆▅▆▅▆▆▄▆▅▃▄▁▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▅▆█████▇ █
  7.25 μs      Histogram: log(frequency) by time      9.41 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 473 evaluations per sample.
 Range (minmax):  225.241 ns314.023 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     227.677 ns                GC (median):    0.00%
 Time  (mean ± σ):   229.662 ns ±   6.165 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▁▇█                                                   
  ▂▂▅████▇▅▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂▁▁▁▁▁▁▂▁▂▁▂▂▂▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  225 ns           Histogram: frequency by time          250 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  213.921 μs  5.646 ms   GC (min … max):  0.00% … 93.16%
 Time  (median):     222.025 μs                GC (median):     0.00%
 Time  (mean ± σ):   255.271 μs ± 366.968 μs   GC (mean ± σ):  10.85% ±  7.18%

   ▅█▆▄▆▆▅▃▂▂▁▁▁▁▁                                             ▂
  ▇███████████████▇▇▆▆▅▆▄▅▄▃▁▃▅▁▄▅▄▃▄▃▁▁▃▁▃▃▁▁▁▁▁▁▁▁▁▁▁▄▆▇█▆▆ █
  214 μs        Histogram: log(frequency) by time        364 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  174.366 μs  5.508 ms   GC (min … max):  0.00% … 95.64%
 Time  (median):     181.096 μs                GC (median):     0.00%
 Time  (mean ± σ):   213.398 μs ± 371.280 μs   GC (mean ± σ):  13.14% ±  7.25%

  ▂▆█▄▄▅▅▄▂▁▁                                                  ▂
  █████████████▇▇▆▅▆▄▃▃▁▃▃▃▅▃▄▁▄▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▆▆▇▇▅ █
  174 μs        Histogram: log(frequency) by time        321 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.15
  [09ab397b] StructArrays v0.6.18
  [9f78cca6] SummationByPartsOperators v0.5.82-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`