Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, DiffEqOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)
D_DEO = CenteredDifference(derivative_order(D_SBP), accuracy_order(D_SBP),
                           step(x), length(x)) * PeriodicBC(eltype(D_SBP))

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_DEO * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (minmax):  28.737 ns51.885 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     29.482 ns               GC (median):    0.00%
 Time  (mean ± σ):   29.683 ns ±  1.174 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▃▇██▂▁                                                  
  ▂▆██████▇▅▄▃▃▃▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▁▂▁▂▂▁▂▂▂▁▂▂▂▂▂▂ ▃
  28.7 ns         Histogram: frequency by time        36.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 520 evaluations.
 Range (minmax):  204.285 ns518.408 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     238.079 ns                GC (median):    0.00%
 Time  (mean ± σ):   241.989 ns ±  27.193 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▁▃▆█▅▄▃▂                                                ▂
  ▅▆▇███████████▇▇▅▅▅▄▄▄▁▁▃▃▃▁▃▁▁▁▃▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▅▆▄▆ █
  204 ns        Histogram: log(frequency) by time        452 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, we benchmark the implementation of the same derivative operator in DiffEqOperators.jl.

doit(D_DEO, "D_DEO:", du, u)
D_DEO:
┌ Warning: #= /home/runner/.julia/packages/DiffEqOperators/lHq9u/src/derivative_operators/convolutions.jl:412 =#:
`LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
@ DiffEqOperators ~/.julia/packages/LoopVectorization/1hqld/src/condense_loopset.jl:1166
┌ Warning: #= /home/runner/.julia/packages/DiffEqOperators/lHq9u/src/derivative_operators/convolutions.jl:460 =#:
`LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
@ DiffEqOperators ~/.julia/packages/LoopVectorization/1hqld/src/condense_loopset.jl:1166
BenchmarkTools.Trial: 10000 samples with 98 evaluations.
 Range (minmax):  793.520 ns79.973 μs   GC (min … max): 0.00% … 98.38%
 Time  (median):     812.429 ns               GC (median):    0.00%
 Time  (mean ± σ):   864.171 ns ±  1.748 μs   GC (mean ± σ):  4.50% ±  2.20%

     ▁▇█▆                                                     
  ▁▂▄████▆▄▃▃▂▂▂▂▂▂▁▁▂▂▂▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  794 ns          Histogram: frequency by time          967 ns <

 Memory estimate: 416 bytes, allocs estimate: 6.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "DiffEqOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9fdde737] DiffEqOperators v4.45.0
  [9f78cca6] SummationByPartsOperators v0.5.60 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations.
 Range (minmax):  383.030 ns683.153 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     387.862 ns                GC (median):    0.00%
 Time  (mean ± σ):   389.252 ns ±   8.708 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▄█                                                      
  ▂▂▄▇██▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▂▂▁▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  383 ns           Histogram: frequency by time          430 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (minmax):  4.669 μs 10.133 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.702 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.721 μs ± 169.574 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▃▇                                                        ▂
  ██▁▄▆▇█▇▅▄▆▆▆▆▅▅▅▅▆▆▆▅▅▄▃▄▄▃▃▄▄▃▃▃▁▁▃▃▁▁▄▁▃▁▁▁▃▁▁▁▃▁▁▁▄▆ █
  4.67 μs      Histogram: log(frequency) by time      5.54 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

FInally, we compare it to a representation as banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (minmax):  6.638 μs 13.042 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.658 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.690 μs ± 291.639 ns   GC (mean ± σ):  0.00% ± 0.00%

  █    ▁                                                     ▁
  █▆▆██▅▅▆▆▆▅▄▄▅▆▅▆▅▄▄▄▆▃▅▅▁▅▃▄▄▁▁▁▄▄▁▄▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▄ █
  6.64 μs      Histogram: log(frequency) by time      8.23 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v0.17.18
  [9f78cca6] SummationByPartsOperators v0.5.60 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations.
 Range (minmax):  383.374 ns523.690 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     387.719 ns                GC (median):    0.00%
 Time  (mean ± σ):   388.901 ns ±   7.039 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▄█                                                      
  ▂▂▃▇██▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▁▁▁▁▁▂▂▁▁▂▂▁▂▁▁▁▂▁▂▂▂▁▂▂▂▂▂▂▂▂ ▃
  383 ns           Histogram: frequency by time          427 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (minmax):  1.013 μs 3.841 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.021 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.027 μs ± 71.831 ns   GC (mean ± σ):  0.00% ± 0.00%

      █                                                  
  ▂▂▆███▄▅▅▃▃▂▂▂▁▂▂▁▂▁▂▁▁▁▂▁▁▂▂▁▁▁▁▂▁▂▂▂▁▂▂▁▂▂▂▂▁▂▂▂▂▂▁▂▁▂ ▃
  1.01 μs        Histogram: frequency by time         1.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (minmax):  5.378 μs 11.074 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.428 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.452 μs ± 196.956 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▆█    ▁                                                  ▂
  ███▄▄▅▇██▅▁▅▆▅▅▄▅▅▆▅▅▅▄▅▅▅▃▃▄▅▃▁▁▁▃▁▁▁▃▁▁▁▄▁▃▁▁▁▃▁▁▁▁▁▁▅▆ █
  5.38 μs      Histogram: log(frequency) by time      6.46 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (minmax):  6.173 μs 11.301 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.191 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.221 μs ± 210.990 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                      ▁                   ▂
  █▄▄▅▆███▆▄▄▅▆▆▅▄▃▅▆▅▅▅▆▄▄▅▅▄▄▁▁▃▅▄▆▅██▁▄▁▃▃▁▁▁▃▅▁▄▁▁▃▁▃▄ █
  6.17 μs      Histogram: log(frequency) by time      7.14 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  132.968 μs302.744 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     134.500 μs                GC (median):    0.00%
 Time  (mean ± σ):   135.798 μs ±   5.110 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▃▇█▄▁ ▁▁               ▁▁                                   ▁
  ▆████████▇▇▆▇▆▆▅▅▅▆▅▅▇█████▇▇▇██▆▇▅▆▄▅▅▅▅▄▄▅▆▅▆▅▅▅▄▄▄▄▅▅▅▅▄ █
  133 μs        Histogram: log(frequency) by time        158 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v0.17.18
  [9f78cca6] SummationByPartsOperators v0.5.60 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations.
 Range (minmax):  44.958 ns95.830 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     46.001 ns               GC (median):    0.00%
 Time  (mean ± σ):   49.229 ns ±  9.230 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆█▄▁                                                  ▂▃▂▁ ▁
  ████▇▇▇▆▅▅▅▆▇▇▅▆▅▇█▇▆▅▅▄▆▄▅▅▄▂▅▃▄▄▅▄▄▅▃▅▄▅▄▄▅▅▅▅▆▆▆▆▇████ █
  45 ns        Histogram: log(frequency) by time      79.4 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 221 evaluations.
 Range (minmax):  334.154 ns640.018 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     341.181 ns                GC (median):    0.00%
 Time  (mean ± σ):   343.213 ns ±  11.424 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▄█▇                                                  
  ▂▂▂▂▃▇█████▅▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  334 ns           Histogram: frequency by time          381 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (minmax):  1.724 μs 3.821 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.735 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.744 μs ± 83.984 ns   GC (mean ± σ):  0.00% ± 0.00%

     █▃▁                                                    
  ▂▅████▅▂▂▂▂▂▂▂▂▂▂▁▂▁▂▁▁▂▂▁▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▁▁▂▂ ▃
  1.72 μs        Histogram: frequency by time        1.87 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (minmax):  1.308 μs 4.356 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.319 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.325 μs ± 72.449 ns   GC (mean ± σ):  0.00% ± 0.00%

            ▁▅ ▇█                                     
  ▂▂▂▂▂▃▁▄▇▁██▁██▇██▁▇▆▁▆▁▅▅▂▃▃▃▁▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂ ▃
  1.31 μs        Histogram: frequency by time        1.35 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (minmax):  2.346 μs 5.484 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.408 μs               GC (median):    0.00%
 Time  (mean ± σ):   2.416 μs ± 91.220 ns   GC (mean ± σ):  0.00% ± 0.00%

                ▂▄▁▆▂▇█▇▄▄ ▂▁                              
  ▁▁▁▁▁▁▂▂▂▄▃▅▇▆█████████████▆▆▄▅▄▂▃▂▂▂▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  2.35 μs        Histogram: frequency by time        2.51 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (minmax):  8.927 μs 19.543 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.004 μs                GC (median):    0.00%
 Time  (mean ± σ):   9.188 μs ± 799.665 ns   GC (mean ± σ):  0.00% ± 0.00%

  █  ▁                                                       ▂
  ██▆▆▆▃▆▇▅▅▁▃▁▁▃▃▄▁▃▄▅▃▃▃▁▃▃▃▁▆█▇▇▇█▇▆▇▇▇▇▇▆▆▆▅▆▆▆▅▆▆▅▆▆▇ █
  8.93 μs      Histogram: log(frequency) by time      13.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 540 evaluations.
 Range (minmax):  210.131 ns285.272 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     214.400 ns                GC (median):    0.00%
 Time  (mean ± σ):   215.063 ns ±   3.518 ns   GC (mean ± σ):  0.00% ± 0.00%

           ▂▅█                                               
  ▂▂▂▂▂▃▃▄▇████▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▂▂▂▂▂▂▁▂▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  210 ns           Histogram: frequency by time          231 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 138 evaluations.
 Range (minmax):  705.587 ns949.594 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     739.928 ns                GC (median):    0.00%
 Time  (mean ± σ):   742.318 ns ±  13.485 ns   GC (mean ± σ):  0.00% ± 0.00%

                   ▁▄▆█                                      
  ▂▁▁▁▁▁▁▁▂▁▁▁▂▂▃▃▆█████▆▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  706 ns           Histogram: frequency by time          805 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (minmax):  7.404 μs 12.754 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.454 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.490 μs ± 295.457 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆                                                         ▁
  █▃▆██▅▃▆▆▆▄▄▅▇▆▆▄▃▄▄▃▃▃▃▁▁▃▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▆ █
  7.4 μs       Histogram: log(frequency) by time      9.42 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 560 evaluations.
 Range (minmax):  207.691 ns250.448 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     210.803 ns                GC (median):    0.00%
 Time  (mean ± σ):   211.456 ns ±   3.163 ns   GC (mean ± σ):  0.00% ± 0.00%

       ▁▄▆██▄▃▁                                    ▁         ▂
  ▅▅▅▆▇█████████▇▇████▅▆▄▁▄▃▄▃▃▄▃▃▁▃▄▄▃▅▅▃▃▅▃▁▄▅▆█████▇▇▇▆▅▆ █
  208 ns        Histogram: log(frequency) by time        227 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 189 evaluations.
 Range (minmax):  538.889 ns776.794 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     551.561 ns                GC (median):    0.00%
 Time  (mean ± σ):   553.552 ns ±  11.540 ns   GC (mean ± σ):  0.00% ± 0.00%

          ▁▅▇█▂                                              
  ▂▂▂▂▃▄▅▇█████▇▆▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  539 ns           Histogram: frequency by time          601 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (minmax):  7.361 μs 13.398 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.416 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.449 μs ± 259.647 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅█    ▁                                                    ▁
  ██▃▆██▆▅▄▇▅▅▃▆▅▆▁▃▃▅▄▃▁▁▄▁▃▁▁▃▁▁▃▁▁▁▁▁▁▃▁▁▁▁▁▃▁▁▃▁▃▁▃▃▁▄▅ █
  7.36 μs      Histogram: log(frequency) by time      9.39 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 496 evaluations.
 Range (minmax):  221.260 ns303.327 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     223.704 ns                GC (median):    0.00%
 Time  (mean ± σ):   224.508 ns ±   4.277 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▁▅▇█                                                     
  ▂▃▅████▅▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▂▁▁▁▂▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  221 ns           Histogram: frequency by time          244 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  212.996 μs  7.433 ms   GC (min … max):  0.00% … 96.18%
 Time  (median):     222.344 μs                GC (median):     0.00%
 Time  (mean ± σ):   272.850 μs ± 442.655 μs   GC (mean ± σ):  10.38% ±  6.19%

   ▃▅█▄▃▃▃▂                                            ▃▄▄▂▁▁▁ ▂
  ▅█████████▇█▇█▇▇▆▇▅▄▄▃▅▄▅▅▄▁▃▁▄▃▄▅▃▃▃▃▄▃▃▃▁▄▁▃▃▄▃▁▁████████ █
  213 μs        Histogram: log(frequency) by time        362 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  176.409 μs  9.593 ms   GC (min … max):  0.00% … 74.10%
 Time  (median):     183.642 μs                GC (median):     0.00%
 Time  (mean ± σ):   234.827 μs ± 461.820 μs   GC (mean ± σ):  12.48% ±  6.20%

  ▃▄█▅▃▃▃▂▁                                           ▃▄▄▃▂▁▁  ▂
  ██████████▇██▇▆▆▇▆▅▅▅▄▅▄▅▅▃▄▃▃▄▃▃▃▃▄▁▁▁▁▄▁▄▄▁▁▃▃▄▁▇████████ █
  176 μs        Histogram: log(frequency) by time        321 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.3
  [09ab397b] StructArrays v0.6.18
  [9f78cca6] SummationByPartsOperators v0.5.60 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`