Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end

doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)

D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (min … max):  28.506 ns … 66.930 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.321 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   29.627 ns ±  1.440 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▄▇█▇▄▂▂                                                   
  ▂▅████████▆▅▄▃▃▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▂▁▁▂▁▂▁▁▁▁▁▁▁▁▁▁▂▁▁▁▂▂▂▂▂▂▂▂▂ ▃
  28.5 ns         Histogram: frequency by time        36.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)

D_sparse:
BenchmarkTools.Trial: 10000 samples with 595 evaluations per sample.
 Range (min … max):  200.020 ns … 419.018 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     203.271 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   205.608 ns ±   7.888 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▅██▅▁                                                      
  ▁▂▇█████▇▅▄▃▂▂▂▂▁▁▁▁▁▁▁▁▁▂▂▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  200 ns           Histogram: frequency by time          232 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.82-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end

doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)

D_SBP:
BenchmarkTools.Trial: 10000 samples with 205 evaluations per sample.
 Range (min … max):  382.029 ns …  1.044 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     386.527 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   389.782 ns ± 13.100 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▂▆█▅▂                                                     
  ▂▃▅█████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▂▁▂▂▁▁▁▁▁▂▁▂▁▁▁▁▂▂▂▂▂▂▃▃▃▃▂▂▂▂▂▂▂ ▃
  382 ns          Histogram: frequency by time          428 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)

D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (min … max):  4.664 μs …  11.144 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.692 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.738 μs ± 255.348 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▅                                                    ▁▁   ▁
  ███▅▁▁▁▃▁▁▁▁▁▄▅▆▆▅▃▃▁▃▁▃▁▁▁▁▁▁▁▁▁▃▁▃▁▁▁▁▁▁▃▁▁▁▁▁▃▁▃▁▁▄▇███▇ █
  4.66 μs      Histogram: log(frequency) by time      5.78 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

FInally, we compare it to a representation as banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)

D_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (min … max):  6.686 μs …  13.609 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.704 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.771 μs ± 357.981 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █                                                           ▁
  █▆▁▁▃▃▁▁▁▁▃▅▅▅▃▃▃▃▁▄▄▁▁▁▃▄▁▃▃▄▁▁▃▁▁▁▁▁▄▁▁▃▁▃▃▇███▆▅▅▆▅▅▆▆▆▆ █
  6.69 μs      Histogram: log(frequency) by time      8.53 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.82-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end

doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)

D_SBP:
BenchmarkTools.Trial: 10000 samples with 202 evaluations per sample.
 Range (min … max):  387.653 ns … 522.411 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     392.069 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   395.291 ns ±  11.121 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▁▅▇██▇▇▅▄▂▁                                     ▂▃▃▃▂▁       ▂
  ▆████████████▅▆▇▇▇▇▆▆▄▃▄▃▁▁▁▃▁▃▄▁▁▃▃▃▃▁▁▁▁▁▁▄▃▁▅████████▇▆▆▆▆ █
  388 ns        Histogram: log(frequency) by time        434 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 19 evaluations per sample.
 Range (min … max):  970.211 ns …  2.393 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     977.105 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   989.593 ns ± 69.530 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▂                                                          ▂
  ███▄▁▄▅█▅▃▁▁▃▁▄▅▄▅▄▄▃▃▁▁▄▄▁▄▄▄▅▁▁▅▅▅▅▆▆▆▆▆▆▅▅▅▅▄▃▄▁▁▁▃▁▁▁▆██ █
  970 ns        Histogram: log(frequency) by time      1.36 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)

Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (min … max):  5.519 μs …  14.601 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.632 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.692 μs ± 347.091 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂▆▇█▇▆▂                                            ▁▁▁▁    ▂
  ▆███████▆▅▄▃▁▁▄▄▆▆▆▄▄▁▁▁▁▃▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▆█████▇▆ █
  5.52 μs      Histogram: log(frequency) by time      6.97 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (min … max):  6.228 μs …  16.088 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.264 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.485 μs ± 479.487 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆                ▇▁                    ▁                   ▂
  ██▄▁▅▆▃▁▃▄▅▆▅▄▁▁▁▆██▃▃▄▃▃▁▁▁▅▇▅▄▄▁▁▄▁▁▄▆██▇▆▅▅▄▆▅▆▄▅▅▅▅▅▅▆▇ █
  6.23 μs      Histogram: log(frequency) by time      8.36 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)

Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  132.848 μs … 300.772 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     134.662 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   136.512 μs ±   5.825 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▅▇█▇▆▄▃▂▂▂▁              ▃▄▄▂▁▁                             ▂
  ▅██████████████▆▆▆▆▅▅▅▅▅▅▇██████████▇▇▇▆▅▂▄▄▅▄▄▃▄▄▄▄▃▃▂▃▄▄▅▃▄ █
  133 μs        Histogram: log(frequency) by time        155 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.82-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

StaticVectors from StaticArrays.jl
StructArrays.jl

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)

101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))

Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 990 evaluations per sample.
 Range (min … max):  45.003 ns … 77.832 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     46.066 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.446 ns ±  1.734 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▁▅█▆▇▅▃▁                                                 
  ▂▆█████████▆▅▄▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▂▁▁▁▁▂▁▁▁▁▁▁▂▁▂▂▂▂▂▃▂▃▃▃▂▂▂ ▃
  45 ns           Histogram: frequency by time        53.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 221 evaluations per sample.
 Range (min … max):  334.923 ns … 655.701 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     341.226 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   344.597 ns ±  12.050 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▄██▇▄▁                                                   
  ▂▂▂▃▇██████▇▆▅▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▃▃▃▃▃▃▂▂▂▂▂▂▂ ▃
  335 ns           Histogram: frequency by time          380 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min … max):  1.711 μs …   3.893 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.722 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.742 μs ± 119.383 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▇                                                          ▁
  ██▇▁▃▁▇▆▃▃▁▃▁▃▄▅▅▄▄▃▁▁▁▁▁▃▁▁▄▁▁▁▃▁▁▁█▄▁▁▄▁▁▃▃▁▁▁▁▁▃▃▁▁▁▃▆██ █
  1.71 μs      Histogram: log(frequency) by time      2.47 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))

Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min … max):  1.297 μs …  3.646 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.310 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.323 μs ± 98.443 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▁                                                        ▂
  ███▁▄▁▄▃▁▃▃▁▁▁▃▅▁▅▄▄▁▃▃▁▃▄▁▄▄▄▅▄▃▄▃▃▃▄▁▄▁▄▁▁▁▃▄▃▁▁▁▁▁▁▁▃▄▇ █
  1.3 μs       Histogram: log(frequency) by time     2.01 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (min … max):  2.450 μs …   5.278 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.619 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.638 μs ± 145.517 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

         ▁▅▇██▅                                               
  ▂▂▃▃▄▅▆███████▅▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▂▂▂▂▂▂▁▁▂▂▂▂▂▂▂▂▂▂ ▃
  2.45 μs         Histogram: frequency by time        3.42 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (min … max):  8.923 μs …  15.175 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.970 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.051 μs ± 454.588 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▃                                                      ▁▁ ▂
  ███▅▅▅▇█▄▁▁▁▃▁▅▅▅▅▄▃▃▁▁▁▁▁▃▁▁▁▁▄▁▁▁▁▁▁▁▁▁▁▃▁▁▃▁▁▃▁▁▁▁▁▃▇███ █
  8.92 μs      Histogram: log(frequency) by time      11.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))

Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 540 evaluations per sample.
 Range (min … max):  211.209 ns … 269.763 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     212.637 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   214.333 ns ±   4.755 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █▇▄▁                                                       
  ▂▂▇████▆▄▃▂▂▂▁▂▂▂▂▂▂▂▂▁▁▁▂▂▁▂▂▂▁▂▁▁▁▁▁▂▁▁▁▂▁▂▁▁▂▂▃▄▃▃▃▂▂▂▂▂▂▂ ▃
  211 ns           Histogram: frequency by time          229 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 187 evaluations per sample.
 Range (min … max):  549.316 ns … 886.150 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     560.406 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   565.608 ns ±  15.862 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁▅██▇▅▂                                                 
  ▂▂▂▃▄▇████████▆▅▅▄▄▄▄▃▃▃▃▂▃▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▂▃▃▃▄▃▃▃▃▃▂▂▂▂▂▂▂▂ ▃
  549 ns           Histogram: frequency by time          611 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (min … max):  7.276 μs …  15.572 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.344 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.418 μs ± 423.280 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃██▄                                                   ▁▁▁  ▂
  ████▄▁▃▄▁▁▁▁▁▃▄▆▆▄▄▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▁▁▁▁▁▁▃▃▁▁▃▅▆████ █
  7.28 μs      Histogram: log(frequency) by time      9.24 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))

Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 565 evaluations per sample.
 Range (min … max):  206.012 ns … 291.218 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     207.965 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   209.597 ns ±   4.637 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▃▅██▇▇▅▄▁                                     ▂▄▄▄▃▁       ▂
  ▃▆███████████▆▄▆▇███▆▅▄▃▃▃▁▃▁▁▁▁▃▁▁▁▁▁▁▁▄▁▁▁▄▆▁▃████████▇▆▆▅▇ █
  206 ns        Histogram: log(frequency) by time        223 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 184 evaluations per sample.
 Range (min … max):  569.326 ns … 816.310 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     584.245 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   588.709 ns ±  14.970 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

            ▃▇██▇▆▃                                             
  ▁▁▁▁▁▂▂▃▅▇████████▆▅▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▃▂▂▂▂▂▂▁▁▁▁▁ ▂
  569 ns           Histogram: frequency by time          633 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (min … max):  7.081 μs …  17.205 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.138 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.216 μs ± 484.711 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▇▃                                                    ▁▁  ▁
  ████▄▁▁▃▁▁▁▁▁▃▅▆▆▅▃▁▃▁▁▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▄▆████ █
  7.08 μs      Histogram: log(frequency) by time      9.02 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))

Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 492 evaluations per sample.
 Range (min … max):  221.591 ns … 307.585 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     223.915 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   225.691 ns ±   5.501 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▅▇█▇▄                                                     
  ▂▃▅███████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▁▁▂▁▁▁▂▁▁▁▁▁▂▂▂▂▃▃▄▄▃▃▃▃▂▂▂▂▂▂ ▃
  222 ns           Histogram: frequency by time          242 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  213.268 μs …   5.025 ms  ┊ GC (min … max):  0.00% … 94.97%
 Time  (median):     219.480 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   249.595 μs ± 343.353 μs  ┊ GC (mean ± σ):  10.44% ±  7.20%

   ▆█▆▃▄▆▄▂▁  ▁                                                 ▂
  ███████████▇█▇▇▅▅▅▃▄▄▄▄▅▅▄▁▃▁▃▁▁▃▁▃▁▁▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▄▆▇▇▆ █
  213 μs        Histogram: log(frequency) by time        355 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  174.606 μs …   4.921 ms  ┊ GC (min … max):  0.00% … 95.52%
 Time  (median):     179.255 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   208.140 μs ± 334.085 μs  ┊ GC (mean ± σ):  11.97% ±  7.13%

  ▅█▇▆▃▅▅▃▁  ▂▂                                                 ▂
  ██████████▇███▆▆▅▃▄▃▄▅▃▃▁▄▃▄▄▃▁▁▁▄▄▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▇█▆ █
  175 μs        Histogram: log(frequency) by time        312 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.13
  [09ab397b] StructArrays v0.6.18
  [9f78cca6] SummationByPartsOperators v0.5.82-DEV `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`