Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, DiffEqOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)
D_DEO = CenteredDifference(derivative_order(D_SBP), accuracy_order(D_SBP),
                           step(x), length(x)) * PeriodicBC(eltype(D_SBP))

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_DEO * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 994 evaluations per sample.
 Range (minmax):  29.764 ns75.929 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     30.338 ns               GC (median):    0.00%
 Time  (mean ± σ):   30.660 ns ±  1.577 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂▆██▆▄▄▂▁▁                                          ▁▁▁  ▂
  ██████████▇▆▆▅▆▅▇▅▅▃▁▃▁▃▅▃▃▁▃▄▁▁▁▁▁▃▁▁▁▁▁▁▁▁▃▁▁▄▄▆▇█████ █
  29.8 ns      Histogram: log(frequency) by time      38.1 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 595 evaluations per sample.
 Range (minmax):  200.274 ns333.951 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     203.506 ns                GC (median):    0.00%
 Time  (mean ± σ):   205.559 ns ±   5.901 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▃█▆▄▃                                                     
  ▁▂▄██████▇▆▅▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▂▂▂▃▃▃▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  200 ns           Histogram: frequency by time          225 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, we benchmark the implementation of the same derivative operator in DiffEqOperators.jl.

doit(D_DEO, "D_DEO:", du, u)
D_DEO:
┌ Warning: #= /home/runner/.julia/packages/DiffEqOperators/lHq9u/src/derivative_operators/convolutions.jl:412 =#:
`LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
@ DiffEqOperators ~/.julia/packages/LoopVectorization/tIJUA/src/condense_loopset.jl:1166
┌ Warning: #= /home/runner/.julia/packages/DiffEqOperators/lHq9u/src/derivative_operators/convolutions.jl:460 =#:
`LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
@ DiffEqOperators ~/.julia/packages/LoopVectorization/tIJUA/src/condense_loopset.jl:1166
BenchmarkTools.Trial: 10000 samples with 109 evaluations per sample.
 Range (minmax):  768.404 ns68.071 μs   GC (min … max): 0.00% … 98.06%
 Time  (median):     787.615 ns               GC (median):    0.00%
 Time  (mean ± σ):   843.433 ns ±  1.609 μs   GC (mean ± σ):  4.65% ±  2.41%

   ▂▅▇██▆▅▄▃▂▂▁     ▂▂▂▁▁▂▂▂▁▁▂▃▃▃▂▁▁                        ▂
  ▆██████████████▇▇███████████████████▇▇▆▅▅▆▆▇▆▅▇█▆▆▆▇▇▆▆▅▆▅ █
  768 ns        Histogram: log(frequency) by time       956 ns <

 Memory estimate: 416 bytes, allocs estimate: 6.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "DiffEqOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9fdde737] DiffEqOperators v4.45.0
  [9f78cca6] SummationByPartsOperators v0.5.74 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 203 evaluations per sample.
 Range (minmax):  382.493 ns997.384 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     387.276 ns                GC (median):    0.00%
 Time  (mean ± σ):   390.636 ns ±  12.952 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▅█                                                   
  ▂▂▄▇████▅▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▂▁▁▁▂▁▂▂▂▂▂▃▃▃▃▂▂▂▂▂▂ ▃
  382 ns           Histogram: frequency by time          429 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations per sample.
 Range (minmax):  4.769 μs 10.375 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.819 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.868 μs ± 269.947 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▇                                                ▁▁▁    ▁
  ███▇▄██▄▁▁▁▃▃▅▆▅▅▄▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▆▇████▇▆ █
  4.77 μs      Histogram: log(frequency) by time      5.96 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

FInally, we compare it to a representation as banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.640 μs 13.768 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.658 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.721 μs ± 341.567 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                           ▁
  █▁█▇▃▄▁▄▁▄▅▆▅▄▄▃▆▆▅▄▃▃▃▃▁▁▁▁▄▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▄▇██▇▇▇▆▆▄▆▅ █
  6.64 μs      Histogram: log(frequency) by time      8.38 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v0.17.18
  [9f78cca6] SummationByPartsOperators v0.5.74 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 201 evaluations per sample.
 Range (minmax):  394.970 ns579.040 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     402.144 ns                GC (median):    0.00%
 Time  (mean ± σ):   406.550 ns ±  14.852 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▃▇█                                                     
  ▂▃▆████▆▃▂▂▂▂▂▂▂▂▂▁▁▁▂▂▁▁▁▁▂▂▂▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  395 ns           Histogram: frequency by time          471 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.283 μs  3.800 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.352 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.368 μs ± 114.191 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▁▅█   ▁▂▂▁                                              ▂
  ▅▄███▆▆▇█████▇▅▄▅▅▄▃▃▄▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▄▇▇ █
  1.28 μs      Histogram: log(frequency) by time      2.08 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (minmax):  5.235 μs 13.462 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.265 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.318 μs ± 311.321 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆                                                   ▁▁   ▁
  ██▁▄▇█▃▁▁▁▁▄▅▆▄▅▃▃▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▅▇████▆ █
  5.23 μs      Histogram: log(frequency) by time      6.57 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations per sample.
 Range (minmax):  6.194 μs 11.972 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.214 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.377 μs ± 445.425 ns   GC (mean ± σ):  0.00% ± 0.00%

  █             ▄▄                                         ▁
  █▆▄█▁▁▄▁▄▅▅▅▃▃▁██▃▃▃▆▁▄▃▄▄▄▄▅▅▆▁▁▁▃▃▅▆▇█▇▇▆▆▅▆▅▅▅▅▅▄▅▆▆▆▇ █
  6.19 μs      Histogram: log(frequency) by time      8.34 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  137.928 μs356.177 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     139.962 μs                GC (median):    0.00%
 Time  (mean ± σ):   141.919 μs ±   5.570 μs   GC (mean ± σ):  0.00% ± 0.00%

  ▂▅▆▇█▇▅▃▁▁▁▁            ▂▃▄▅▄▃▂  ▁                          ▃
  ████████████▇▆▆▆▅▆▅▆▆▆▆█████████▇███▇▆▇▅▆▆▆▆▅▅▁▄▃▅▆▆█▇▆▆▆▇▆ █
  138 μs        Histogram: log(frequency) by time        162 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v0.17.18
  [9f78cca6] SummationByPartsOperators v0.5.74 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations per sample.
 Range (minmax):  45.343 ns90.199 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     45.971 ns               GC (median):    0.00%
 Time  (mean ± σ):   46.417 ns ±  1.826 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▂▆██▇▆▅▄▃▂▁                                       ▁▂▂▁▁▁  ▃
  █████████████▇▇▆▇▆▅▅▅▃▃▃▁▁▁▁▁▄▆▆▄▃▁▃▁▅▁▁▁▁▁▁▁▁▄▅▇████████ █
  45.3 ns      Histogram: log(frequency) by time      53.9 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 228 evaluations per sample.
 Range (minmax):  325.610 ns599.281 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     332.289 ns                GC (median):    0.00%
 Time  (mean ± σ):   335.325 ns ±  11.724 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▁▄▆▇█▅▂▁                                                 
  ▂▂▃▅████████▅▄▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▁▂▂▂▂▂▃▃▃▃▃▃▃▃▂▂▂▂▂ ▃
  326 ns           Histogram: frequency by time          370 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.715 μs  6.071 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.726 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.871 μs ± 396.595 ns   GC (mean ± σ):  0.00% ± 0.00%

                                              ▁▃▃      ▁
  █▅▇▇▄▁▅▆▅▅▅█▄▅▅▄▄▄█▅▃▄▄▅▄▅▄▇▄▅▆█▇▆▆▆▆▆▆▇▅▅▆▆▆▆▄▅▇███▇▆▅▅▄ █
  1.72 μs      Histogram: log(frequency) by time      3.04 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (minmax):  1.298 μs  3.242 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.307 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.321 μs ± 101.154 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▄▄▄▇▆▁▁▁▁▁▃▁▃▄▅▄▄▁▃▁▄▃▁▄▅▃▃▃▄▄▄▅▁▃▄▁▁▁▁▄▁▃▄▃▁▃▄▃▁▄▁▃▃▅▆▇ █
  1.3 μs       Histogram: log(frequency) by time      2.03 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations per sample.
 Range (minmax):  2.351 μs  5.509 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.404 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.430 μs ± 148.528 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▇█▃▁                                                    ▂
  ██████▇▇▆▅▄▁▁▁▄▄▄▆▄▄▁▅▃▄▁▁▁▅▁▁▁▁▁▃▅▇▅▅▅▄▁▁▁▃▁▁▁▁▁▁▁▅▆████ █
  2.35 μs      Histogram: log(frequency) by time      3.22 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 3 evaluations per sample.
 Range (minmax):  8.937 μs 15.863 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     8.993 μs                GC (median):    0.00%
 Time  (mean ± σ):   9.091 μs ± 513.941 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇                                                         ▁
  █▄▄██▄▃▁▃▁▁▄▅▆▆▅▄▄▁▁▄▁▃▃▃▄▁▁▁▃▃▁▁▁▁▁▁▃▃▁▁▁▃▁▁▁▃▁▁▇█▇████ █
  8.94 μs      Histogram: log(frequency) by time      11.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 570 evaluations per sample.
 Range (minmax):  205.121 ns281.175 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     208.004 ns                GC (median):    0.00%
 Time  (mean ± σ):   209.858 ns ±   5.289 ns   GC (mean ± σ):  0.00% ± 0.00%

       █▇                                                      
  ▂▂▂▃████▆▄▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▂▂▂▂▂▂▂▂▃▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  205 ns           Histogram: frequency by time          231 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 188 evaluations per sample.
 Range (minmax):  537.926 ns788.819 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     552.473 ns                GC (median):    0.00%
 Time  (mean ± σ):   557.286 ns ±  15.598 ns   GC (mean ± σ):  0.00% ± 0.00%

           ▄▇█▃▁                                               
  ▂▁▁▂▂▂▃▄██████▅▃▃▃▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▁▂▂▂▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  538 ns           Histogram: frequency by time          612 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.336 μs21.195 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.419 μs               GC (median):    0.00%
 Time  (mean ± σ):   7.694 μs ±  1.404 μs   GC (mean ± σ):  0.00% ± 0.00%

  █         ▁▂                                              ▁
  █▇▅▇▃▃▁▁▃██▆▆▆▅▄▃▄▃▃▄▁▁▃▄▃▄▁▁▁▁▃▁▁▁▁▁▁▁▆▄▁▁▃▆▇▆▆▆▅▅▆▄▅▅▆ █
  7.34 μs      Histogram: log(frequency) by time     16.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 540 evaluations per sample.
 Range (minmax):  210.152 ns285.589 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     214.048 ns                GC (median):    0.00%
 Time  (mean ± σ):   216.073 ns ±   6.375 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▂▇█                                                      
  ▂▂▃▅████▄▃▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▃▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▂▂▂▂▂▂ ▃
  210 ns           Histogram: frequency by time          246 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 182 evaluations per sample.
 Range (minmax):  579.165 ns 1.074 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     587.253 ns               GC (median):    0.00%
 Time  (mean ± σ):   594.689 ns ± 31.993 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▄▇█▅▂ ▁           ▂▄▄▃▁                                    ▂
  ████████▆▅▃▄▃▄▁▄▁▅███████▇▆▆▄▄▄▃▃▃▃▄▄▁▆▄▃▄▃▃▁▃▄▃▃▄▁▁▄▄▄▅▃▃ █
  579 ns        Histogram: log(frequency) by time       708 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations per sample.
 Range (minmax):  7.414 μs 19.557 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.479 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.566 μs ± 521.487 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▄█   ▁                                               ▁    ▁
  ██▄▆█▆▁▁▁▁▁▄▅▆▆▄▅▃▁▃▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▃▃▁▁▃▁▁▁▁▃▃▇█████▆ █
  7.41 μs      Histogram: log(frequency) by time      9.43 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 496 evaluations per sample.
 Range (minmax):  219.808 ns320.681 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     222.292 ns                GC (median):    0.00%
 Time  (mean ± σ):   224.024 ns ±   5.260 ns   GC (mean ± σ):  0.00% ± 0.00%

      ▂▅█▇                                                 
  ▂▃▄▆█████▇▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▃▃▃▄▃▃▃▃▂▂▂▂ ▃
  220 ns           Histogram: frequency by time          239 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  214.271 μs  6.929 ms   GC (min … max): 0.00% … 96.23%
 Time  (median):     224.229 μs                GC (median):    0.00%
 Time  (mean ± σ):   271.484 μs ± 413.777 μs   GC (mean ± σ):  9.85% ±  6.24%

    █▃                                                          
  ▂▇██▆▅▄▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂▁▂▁▂▁▂▂▃▄▃▃▃▃▂▂▁▂ ▂
  214 μs           Histogram: frequency by time          376 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation per sample.
 Range (minmax):  174.617 μs  6.726 ms   GC (min … max):  0.00% … 96.77%
 Time  (median):     182.451 μs                GC (median):     0.00%
 Time  (mean ± σ):   229.652 μs ± 416.134 μs   GC (mean ± σ):  11.72% ±  6.28%

   ▂▄                                                          
  ▅██▄▄▆▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▂▁▁▁▂▂▂▁▂▁▂▁▂▂▂▂▄▄▃▂▃▃▂▂ ▃
  175 μs           Histogram: frequency by time          326 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.10
  [09ab397b] StructArrays v0.6.18
  [9f78cca6] SummationByPartsOperators v0.5.74 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`