Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end

doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)

D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations per sample.
 Range (min … max):  28.404 ns … 59.740 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     29.180 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   29.439 ns ±  1.459 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▃▆██▆▄                                                    
  ▃▇███████▆▄▃▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▂▁▁▁▁▁▁▂▁▂▂▂▁▁▁▁▁▂▂▁▁▁▂▂▂▂▂▂▂▂▂ ▃
  28.4 ns         Histogram: frequency by time        36.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)

D_sparse:
BenchmarkTools.Trial: 10000 samples with 651 evaluations per sample.
 Range (min … max):  187.955 ns … 351.347 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     192.865 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   194.705 ns ±   6.255 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▁▅▆█▆▃▁                                               
  ▁▁▁▁▁▂▃▅███████▇▅▄▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▂▂▂▃▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  188 ns           Histogram: frequency by time          212 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators"],
           mode=PKGMODE_MANIFEST)

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9f78cca6] SummationByPartsOperators v0.5.85 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end

doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)

D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (min … max):  380.167 ns … 620.867 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     385.103 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   388.563 ns ±  12.533 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▄█▇▂                                                      
  ▂▂▃▆████▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▁▁▁▁▁▂▁▁▁▂▂▁▂▂▂▂▂▃▃▃▃▃▂▂▂▂▂▂▂▂ ▃
  380 ns           Histogram: frequency by time          429 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)

D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (min … max):  4.573 μs …   8.077 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     4.627 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   4.667 μs ± 220.753 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▅█▇▃                                                   ▁▁▁ ▂
  █████▁▁▅▅▅▁▁▄▄▁▄▅▅▆▅▅▄▁▁▁▄▁▁▆▅▃▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▁▃▃▁▁▁▁▃▅▇███ █
  4.57 μs      Histogram: log(frequency) by time      5.71 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

FInally, we compare it to a representation as banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)

D_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (min … max):  6.676 μs …  14.200 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.705 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.778 μs ± 397.940 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅                                                          ▁
  ██▄▃▆█▄▁▄▃▄▄▄▅▄▅▅▄▃▁▄▃▄▃▁▁▃▄▁▁▄▃▄▁▄▁▃▁▁▁▁▁▁▄▅███▇▇▆▆▆▆▅▆▅▄▆ █
  6.68 μs      Histogram: log(frequency) by time      8.63 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.85 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end

doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)

D_SBP:
BenchmarkTools.Trial: 10000 samples with 200 evaluations per sample.
 Range (min … max):  403.055 ns … 612.190 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     411.415 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   414.838 ns ±  12.118 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▃▆▇█▆▃▁                                                 
  ▂▂▂▄▅████████▇▅▅▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▂▁▂▂▂▂▂▂▃▃▃▃▃▃▃▂▂▂▂▂ ▃
  403 ns           Histogram: frequency by time          455 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min … max):  1.361 μs …  3.364 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.576 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.589 μs ± 98.542 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▇█▃                                            ▁
  ▃▄▄▃▃▁▃▁▁▅▄▄███▄▄▇▇▃▃▁▃▁▃▄▄▅▄▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅█ █
  1.36 μs      Histogram: log(frequency) by time     2.31 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)

Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (min … max):  5.520 μs …   9.182 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.575 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.625 μs ± 254.614 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▆█▆                                                    ▁▁▁ ▁
  ████▇▄▄▇█▅▃▃▃▃▁▄▄▅▆▆▄▄▁▃▁▁▁▁▁▁▁▁▃▄▁▄▁▁▁▁▄▁▁▁▁▃▁▁▁▁▁▁▃▃▄▇███ █
  5.52 μs      Histogram: log(frequency) by time      6.85 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (min … max):  6.250 μs …  14.954 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.276 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.359 μs ± 373.545 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆                 ▂▂                                       ▁
  ██▅▄▄█▄▁▁▄▄▁▄▆▅▅▄▄▇██▁▄▃▅▄▁▃▃▃▃▃▄▅▄▁▃▁▁▁▁▁▃▄▇██▆▆▆▅▅▆▅▅▄▆▆▆ █
  6.25 μs      Histogram: log(frequency) by time      8.21 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)

Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  136.796 μs … 367.617 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     138.910 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   141.241 μs ±   6.826 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▂▆██▆▄▂▁▂▁▁          ▂▄▄▃▁ ▁                                 ▂
  ▇████████████▇▆▇▆▆▅▅▅███████████▇▇▇▆▆▆▅▆▅▅▄▅▅▅▆▇▆▆▆▆▅▆▅▆▅▆▆▄▆ █
  137 μs        Histogram: log(frequency) by time        165 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v1.7.6
  [9f78cca6] SummationByPartsOperators v0.5.85 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

StaticVectors from StaticArrays.jl
StructArrays.jl

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)

101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))

Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (min … max):  45.555 ns … 72.775 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     46.305 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   46.727 ns ±  1.828 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂▆▇▇██▇▇▆▅▄▂▂▁                                     ▁▁▂▂▁▁▁  ▃
  ███████████████▆▆▆▅▆▆▅▆▅▅▅▃▁▃▃▄▆▃▁▃▃▁▃▃▁▄▁▁▁▁▁▁▃▃▆████████▇ █
  45.6 ns      Histogram: log(frequency) by time      54.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 233 evaluations per sample.
 Range (min … max):  319.438 ns … 511.343 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     326.708 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   330.136 ns ±  11.409 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▆█▇▄▃                                                   
  ▁▁▂▃▆███████▇▅▅▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁ ▂
  319 ns           Histogram: frequency by time          367 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min … max):  1.711 μs …   3.888 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.723 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.741 μs ± 115.101 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ██▁                                                         ▂
  ███▄▄▃▅▅▃▃▁▅█▁▃▃▄▆▅▅▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▃▁▁▃▃▁▃▃▁▁▁▁▃▃▁▁▁▅▆▇█ █
  1.71 μs      Histogram: log(frequency) by time      2.47 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))

Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min … max):  1.304 μs …  3.413 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.312 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.325 μs ± 95.753 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▆                                                         ▁
  ██▄▃▃▅█▇▁▁▁▁▁▃▁▃▁▄▅▄▁▁▄▄▁▃▁▁▁▁▃▄▃▃▃▁▃▁▄▁▁▃▁▁▁▁▁▁▁▁▁▁▁▃▁▁▄▇ █
  1.3 μs       Histogram: log(frequency) by time     2.04 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (min … max):  2.349 μs …   5.092 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.416 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.439 μs ± 143.428 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▆█▄                                                      
  ▂▃▇███▆▃▂▂▂▂▁▁▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▂▂▁▁▁▁▂▁▁▁▁▂▂▂▂▂▂▂ ▂
  2.35 μs         Histogram: frequency by time        3.24 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (min … max):  8.763 μs …  17.089 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     8.820 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   8.913 μs ± 504.148 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▄                                                     ▁▁  ▁
  ███▆▄▃▆▆▄▁▃▄▄▃▄▁▅▆▅▄▅▁▄▃▁▃▁▁▁▃▃▁▃▁▁▃▇▇▁▁▁▁▄▁▁▁▁▁▁▁▃▁▁▄▄▇███ █
  8.76 μs      Histogram: log(frequency) by time      11.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))

Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 254 evaluations per sample.
 Range (min … max):  299.024 ns … 477.469 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     303.480 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   306.360 ns ±  10.551 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▅▇█▇▆▃▃▂▁                                  ▁▃▃▂▁          ▂
  ▃▅▅██████████▆▅▃▄▆▇█▇▆▄▃▄▁▁▄▁▁▁▃▁▃▃▃▄▃▄▃▁▁▃▄▁▄▇███████▇▆▆▅▆▅▆ █
  299 ns        Histogram: log(frequency) by time        339 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 133 evaluations per sample.
 Range (min … max):  723.459 ns …  1.197 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     739.729 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   746.171 ns ± 21.565 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▁▂▄▅▇██▇▆▄▃▂▁▁▁  ▁                       ▂▃▃▃▃▂▁        ▂
  ▄▄▆▇████████████████████▆▆▆▆▄▅▃▃▄▄▁▄▁▁▁▁▃▃▃▇▇█████████▇▇▇▆▆▆ █
  723 ns        Histogram: log(frequency) by time       809 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (min … max):  7.399 μs …  17.846 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.462 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.545 μs ± 473.264 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▇▂                                                   ▁▁▁  ▁
  ████▃▃▄▅▅▄▃▄▃▃▃▃▆▅▅▆▄▁▄▁▃▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▁▃▁▇▄▁▁▁▁▁▃▄▆████ █
  7.4 μs       Histogram: log(frequency) by time       9.4 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))

Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 555 evaluations per sample.
 Range (min … max):  206.710 ns … 280.092 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     209.546 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   211.279 ns ±   4.983 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▅██▇▆▅▂                                  ▃▄▄▃▂▁         ▂
  ▃▁▃▄▇████████▇▆▄▄▄▆█▇▇▇▇▅▃▁▁▃▁▃▁▁▁▁▁▁▁▃▁▁▁▄▁▄▆████████▇▇▇▅▆▆▅ █
  207 ns        Histogram: log(frequency) by time        226 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 184 evaluations per sample.
 Range (min … max):  558.766 ns … 946.500 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     576.728 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   581.574 ns ±  17.496 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▃▅█▇▄▃▁                                            
  ▂▂▂▂▂▂▃▃▄▆████████▆▅▄▃▃▃▂▂▂▂▂▂▂▂▂▁▂▁▂▁▂▂▂▂▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂ ▃
  559 ns           Histogram: frequency by time          634 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (min … max):  7.304 μs …  13.543 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     7.364 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   7.444 μs ± 437.807 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▇                                                    ▁▁   ▁
  ████▃▁▄▅▅▁▁▄▄▃▁▄▅▆▅▅▅▇▄▃▃▃▃▁▁▁▁▃▃▁▁▁▁▁▁▃▁▁▁▁▁▁▃▁▁▁▃▁▃▅████▇ █
  7.3 μs       Histogram: log(frequency) by time      9.33 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))

Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 496 evaluations per sample.
 Range (min … max):  219.744 ns … 339.403 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     222.230 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   224.067 ns ±   5.669 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▅███▄▁                                                    
  ▂▃▅███████▆▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▂▁▁▁▂▁▂▂▂▂▃▃▄▃▄▃▃▃▂▂▂▂▂▂▂▂▂ ▃
  220 ns           Histogram: frequency by time          241 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  212.167 μs …   6.330 ms  ┊ GC (min … max):  0.00% … 96.14%
 Time  (median):     219.601 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   252.715 μs ± 404.081 μs  ┊ GC (mean ± σ):  11.61% ±  6.95%

  ▂▆██▅▅▆▅▃▂▁▁▁                                                 ▂
  ██████████████▆▆▆▄▃▅▃▄▅▃▅▃▁▄▁▃▃▄▃▄▄▁▁▃▁▁▃▁▃▃▁▁▁▁▃▁▁▁▁▁▁▄▆▇█▆▅ █
  212 μs        Histogram: log(frequency) by time        357 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (min … max):  177.592 μs …   6.260 ms  ┊ GC (min … max):  0.00% … 96.40%
 Time  (median):     186.328 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   219.175 μs ± 415.440 μs  ┊ GC (mean ± σ):  13.75% ±  6.99%

     █                                                          
  ▃▇▆█▇▄▃▆▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▂▁▂▁▂▁▂▂▁▂▂▂▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂ ▂
  178 μs           Histogram: frequency by time          319 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)

Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.15
  [09ab397b] StructArrays v0.6.18
  [9f78cca6] SummationByPartsOperators v0.5.85 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`