Benchmarks

Here are some simple benchmarks. Take them with a grain of salt since they run on virtual machines in the cloud to generate the documentation automatically.

First-derivative operators

Periodic domains

Let's set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, DiffEqOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = periodic_derivative_operator(derivative_order=1, accuracy_order=2,
                                     xmin=xmin, xmax=xmax, N=100)
x = grid(D_SBP)
D_DEO = CenteredDifference(derivative_order(D_SBP), accuracy_order(D_SBP),
                           step(x), length(x)) * PeriodicBC(eltype(D_SBP))

D_sparse = sparse(D_SBP)

u = randn(eltype(D_SBP), length(x)); du = similar(u);
@show D_SBP * u ≈ D_DEO * u ≈ D_sparse * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 995 evaluations.
 Range (minmax):  28.707 ns50.144 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     29.633 ns               GC (median):    0.00%
 Time  (mean ± σ):   29.933 ns ±  1.491 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▃▅▇█▂                                                  
  ▂▅██████▆▄▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▂▂▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂ ▃
  28.7 ns         Histogram: frequency by time        37.5 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare this to the runtime obtained using a sparse matrix representation of the derivative operator. Depending on the hardware etc., this can be an order of magnitude slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 641 evaluations.
 Range (minmax):  189.949 ns303.874 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     196.857 ns                GC (median):    0.00%
 Time  (mean ± σ):   198.552 ns ±   6.194 ns   GC (mean ± σ):  0.00% ± 0.00%

           ▁▃▃▅█▇▆▆▄▂▁                                         
  ▁▁▁▂▂▃▄▆████████████▇▇▅▄▄▃▃▂▂▂▂▂▂▂▂▂▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁ ▃
  190 ns           Histogram: frequency by time          215 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, we benchmark the implementation of the same derivative operator in DiffEqOperators.jl.

doit(D_DEO, "D_DEO:", du, u)
D_DEO:
┌ Warning: #= /home/runner/.julia/packages/DiffEqOperators/lHq9u/src/derivative_operators/convolutions.jl:412 =#:
`LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
@ DiffEqOperators ~/.julia/packages/LoopVectorization/tIJUA/src/condense_loopset.jl:1166
┌ Warning: #= /home/runner/.julia/packages/DiffEqOperators/lHq9u/src/derivative_operators/convolutions.jl:460 =#:
`LoopVectorization.check_args` on your inputs failed; running fallback `@inbounds @fastmath` loop instead.
Use `warn_check_args=false`, e.g. `@turbo warn_check_args=false ...`, to disable this warning.
@ DiffEqOperators ~/.julia/packages/LoopVectorization/tIJUA/src/condense_loopset.jl:1166
BenchmarkTools.Trial: 10000 samples with 92 evaluations.
 Range (minmax):  799.207 ns138.891 μs   GC (min … max): 0.00% … 56.12%
 Time  (median):     822.402 ns                GC (median):    0.00%
 Time  (mean ± σ):   887.817 ns ±   2.121 μs   GC (mean ± σ):  4.48% ±  2.04%

   ▁▄▆▇██▆▅▄▃▃▃▂▂▂▂▂▁▁▃▃▄▃▂▁▁▁   ▁▂▃▃▂▂▂▁                     ▃
  ▅████████████████████████████████████████▇▇▇██▇▆▇▇██▇▇▇▇▇▆▆ █
  799 ns        Histogram: log(frequency) by time        996 ns <

 Memory estimate: 416 bytes, allocs estimate: 6.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "DiffEqOperators"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [9fdde737] DiffEqOperators v4.45.0
  [9f78cca6] SummationByPartsOperators v0.5.70 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Bounded domains

We start again by setting up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
D_sparse = sparse(D_SBP)
D_banded = BandedMatrix(D_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show D_SBP * u ≈ D_sparse * u ≈ D_banded * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

First, we benchmark the implementation from SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 200 evaluations.
 Range (minmax):  405.355 ns639.595 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     409.310 ns                GC (median):    0.00%
 Time  (mean ± σ):   413.050 ns ±  12.665 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▄█▇                                                        
  ▃▆████▆▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▂▁▂▁▁▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂ ▃
  405 ns           Histogram: frequency by time          457 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

Again, we compare this to a representation of the derivative operator as a sparse matrix. No surprise - it is again much slower, as in periodic domains.

doit(D_sparse, "D_sparse:", du, u)
D_sparse:
BenchmarkTools.Trial: 10000 samples with 7 evaluations.
 Range (minmax):  4.650 μs  8.344 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     4.696 μs                GC (median):    0.00%
 Time  (mean ± σ):   4.743 μs ± 233.698 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▃▇  ▁                                               ▁▁▁  ▂
  ███▇▆██▅▁▁▁▃▅▆▅▅▅▃▃▁▃▁▄▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▃▃▁▃▁▆▇▅▃▅▆█████ █
  4.65 μs      Histogram: log(frequency) by time      5.84 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

FInally, we compare it to a representation as banded matrix. Disappointingly, this is still much slower than the optimized implementation from SummationByPartsOperators.jl.

doit(D_banded, "D_banded:", du, u)
D_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (minmax):  6.640 μs 12.776 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.660 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.743 μs ± 403.719 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  ███▄▃▁▁▃▄▆▄▄▁▄▃▁▃▄▁▃▃▁▁▄▄▁▄▁▁▁▁▄▁▃▁▁▄▁▁▃▄▇███▇▇▆▇▅▆▇▆▇▅▆ █
  6.64 μs      Histogram: log(frequency) by time      8.65 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v0.17.18
  [9f78cca6] SummationByPartsOperators v0.5.70 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Dissipation operators

We follow the same structure as before. At first, we set up some benchmark code.

using BenchmarkTools
using LinearAlgebra, SparseArrays
using SummationByPartsOperators, BandedMatrices

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=6, xmin=xmin, xmax=xmax, N=10^3)
Di_SBP  = dissipation_operator(MattssonSvärdNordström2004(), D_SBP)
Di_sparse = sparse(Di_SBP)
Di_banded = BandedMatrix(Di_SBP)
Di_full   = Matrix(Di_SBP)

u = randn(eltype(D_SBP), size(D_SBP, 1)); du = similar(u);
@show Di_SBP * u ≈ Di_sparse * u ≈ Di_banded * u ≈ Di_full * u

function doit(D, text, du, u)
  println(text)
  sleep(0.1)
  show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D, $u))
  println()
end
doit (generic function with 1 method)

At first, let us benchmark the derivative and dissipation operators implemented in SummationByPartsOperators.jl.

doit(D_SBP, "D_SBP:", du, u)
doit(Di_SBP, "Di_SBP:", du, u)
D_SBP:
BenchmarkTools.Trial: 10000 samples with 202 evaluations.
 Range (minmax):  387.802 ns672.540 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     392.262 ns                GC (median):    0.00%
 Time  (mean ± σ):   395.968 ns ±  12.501 ns   GC (mean ± σ):  0.00% ± 0.00%

   ▁▅▇█▇▆▅▃▁                                    ▂▃▃▃▂▁        ▂
  ▅█████████▇▆▅▆▇█▆▆▅▃▁▃▃▁▁▁▁▄▁▃▃▃▄▄▄▄▃▁▄▅▆▆▆▆█████████▇▆▇▆▆▆ █
  388 ns        Histogram: log(frequency) by time        438 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_SBP:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (minmax):  1.016 μs 2.970 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.027 μs               GC (median):    0.00%
 Time  (mean ± σ):   1.036 μs ± 82.576 ns   GC (mean ± σ):  0.00% ± 0.00%

  █  ▂▁▁▂▂▂▁▁▁▂▁▂▂▂▂▁▂▂▂▂▂▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂ ▂
  1.02 μs        Histogram: frequency by time        1.75 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we compare the results to sparse matrix representations. It will not come as a surprise that these are again much (around an order of magnitude) slower.

doit(Di_sparse, "Di_sparse:", du, u)
doit(Di_banded, "Di_banded:", du, u)
Di_sparse:
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (minmax):  5.240 μs  9.509 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     5.323 μs                GC (median):    0.00%
 Time  (mean ± σ):   5.372 μs ± 256.093 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▇█                                                  ▁▁  ▂
  ████▆██▆▄▃▁▄▄▅▆▆▅▁▁▃▁▁▁▁▁▃▁▁▁▁▁▃▁▁▃▁▁▃▁▁▃▁▃▆▇▃▁▁▃▄▅▇████ █
  5.24 μs      Histogram: log(frequency) by time      6.64 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
Di_banded:
BenchmarkTools.Trial: 10000 samples with 5 evaluations.
 Range (minmax):  6.173 μs 14.791 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     6.823 μs                GC (median):    0.00%
 Time  (mean ± σ):   6.692 μs ± 684.690 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇                          ▁                              ▁
  █▅█▇▄▅▅█▇▃▅▆▆▃▄▅▇█▇▆▆▅▆▅▆██▇▇▅▅▆▅▅▅▆▁▁▃▁▃▁▃▄▃▃▄▁▃▁▁▃▁▃▁▄ █
  6.17 μs      Histogram: log(frequency) by time      10.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's benchmark the same computation if a full (dense) matrix is used to represent the derivative operator. This is obviously a bad idea but 🤷

doit(Di_full, "Di_full:", du, u)
Di_full:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  133.709 μs602.513 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     135.783 μs                GC (median):    0.00%
 Time  (mean ± σ):   138.902 μs ±  12.878 μs   GC (mean ± σ):  0.00% ± 0.00%

   ▆█▂▂▁    ▄▅▃▂▁▁                                            ▂
  █████████████████▇▆▇▅▆▇▇▇▇▇▇▆▆▅▄▆▅▅▅▅▄▄▄▄▃▅▃▁▄▅▅▄▄▄▅▅▅▃▁▄▄▅ █
  134 μs        Histogram: log(frequency) by time        182 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "BandedMatrices"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [aae01518] BandedMatrices v0.17.18
  [9f78cca6] SummationByPartsOperators v0.5.70 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`

Structure-of-Arrays (SoA) and Array-of-Structures (AoS)

SummationByPartsOperators.jl tries to provide efficient support of

To demonstrate this, let us set up some benchmark code.

using BenchmarkTools
using StaticArrays, StructArrays
using LinearAlgebra, SparseArrays
using SummationByPartsOperators

BLAS.set_num_threads(1) # make sure that BLAS is serial to be fair

struct Vec5{T} <: FieldVector{5,T}
  x1::T
  x2::T
  x3::T
  x4::T
  x5::T
end

# Apply `mul!` to each component of a plain array of structures one after another
function mul_aos!(du, D, u, args...)
  for i in 1:size(du, 1)
    mul!(view(du, i, :), D, view(u, i, :), args...)
  end
end

T = Float64
xmin, xmax = T(0), T(1)

D_SBP = derivative_operator(MattssonNordström2004(), derivative_order=1,
                            accuracy_order=4, xmin=xmin, xmax=xmax, N=101)
D_sparse = sparse(D_SBP)
D_full   = Matrix(D_SBP)
101×101 Matrix{Float64}:
 -141.176    173.529   -23.5294   …    0.0         0.0       0.0
  -50.0        0.0      50.0           0.0         0.0       0.0
    9.30233  -68.6047    0.0           0.0         0.0       0.0
    3.06122    0.0     -60.2041        0.0         0.0       0.0
    0.0        0.0       8.33333       0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    ⋮                             ⋱                          ⋮
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0           0.0         0.0       0.0
    0.0        0.0       0.0      …    0.0         0.0       0.0
    0.0        0.0       0.0          -8.33333     0.0       0.0
    0.0        0.0       0.0          60.2041      0.0      -3.06122
    0.0        0.0       0.0           0.0        68.6047   -9.30233
    0.0        0.0       0.0         -50.0         0.0      50.0
    0.0        0.0       0.0      …   23.5294   -173.529   141.176

At first, we benchmark the application of the operators implemented in SummationByPartsOperators.jl and their representations as sparse and dense matrices in the scalar case. As before, the sparse matrix representation is around an order of magnitude slower and the dense matrix representation is far off.

println("Scalar case")
u = randn(T, size(D_SBP, 1)); du = similar(u)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_SBP, $u))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_sparse, $u))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du, $D_full, $u))
Scalar case
D_SBP
BenchmarkTools.Trial: 10000 samples with 989 evaluations.
 Range (minmax):  46.052 ns75.257 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     46.802 ns               GC (median):    0.00%
 Time  (mean ± σ):   47.269 ns ±  1.925 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▄█▆                                                    
  ▃██████▅▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▂▂▁▁▂▂▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▃▃▂▂▂▂ ▃
  46.1 ns         Histogram: frequency by time        55.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 232 evaluations.
 Range (minmax):  318.655 ns513.112 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     328.586 ns                GC (median):    0.00%
 Time  (mean ± σ):   332.282 ns ±  11.530 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▄▆██▅▁▁                                               
  ▁▁▁▁▂▄████████▇▅▄▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁ ▂
  319 ns           Histogram: frequency by time          370 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (minmax):  1.711 μs  4.043 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.724 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.745 μs ± 126.880 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                         ▂
  █▇▅▅█▆▄▁▁█▁▁▄▅▁▅▅▃▃▁▁▁▃▁▁▁▁▃▁▁▁▃▇▄▁▁▁▁▃▃▁▁▁▁▁▃▁▃▃▃▁▄▄▅▇█▇ █
  1.71 μs      Histogram: log(frequency) by time      2.52 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we use a plain array of structures (AoS) in the form of a two-dimensional array and our custom mul_aos! implementation that loops over each component, using mul! on views. Here, the differences between the timings are less pronounced.

println("Plain Array of Structures")
u_aos_plain = randn(T, 5, size(D_SBP, 1)); du_aos_plain = similar(u_aos_plain)
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_SBP, $u_aos_plain))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_sparse, $u_aos_plain))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul_aos!($du_aos_plain, $D_full, $u_aos_plain))
Plain Array of Structures
D_SBP
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (minmax):  1.306 μs  3.338 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.314 μs                GC (median):    0.00%
 Time  (mean ± σ):   1.328 μs ± 101.681 ns   GC (mean ± σ):  0.00% ± 0.00%

  █                                                          ▁
  █▃▄▅█▆▃▄▁▄▁▃▄▄▅▄▄▁▃▁▁▁▁▁▁▁▁▃▁▃▁▁▁▁▁▁▄▁▄▄▁▁▃▅▄▄▃▃▄▁▃▃▃▃▄▆▇ █
  1.31 μs      Histogram: log(frequency) by time      2.06 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (minmax):  2.354 μs  6.020 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     2.410 μs                GC (median):    0.00%
 Time  (mean ± σ):   2.438 μs ± 174.225 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▇█▃                                                     ▂
  ██████▇▆▆▁▁▁▄▅▄▅▅▅▄▄▃▁▁▁▃▁▁▃▁▁▁▁▃▃▁▁▁▁▁▃▁▁▁▁▁▁▁▁▅▆▅▇████▇ █
  2.35 μs      Histogram: log(frequency) by time      3.27 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (minmax):  8.980 μs 21.467 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     9.040 μs                GC (median):    0.00%
 Time  (mean ± σ):   9.138 μs ± 540.497 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▇                                                         ▁
  █▅▄█▇▁▃▁▃▁▃▅▆▅▅▁▁▃▁▃▁▁▁▁▃▁▁▁▁▁▃▁▃▁▃▁▃▁▁▁▁▃▃▁▁▇▄▁▃▃▅▇████ █
  8.98 μs      Histogram: log(frequency) by time      11.8 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Now, we use an array of structures (AoS) based on reinterpret and standard mul!. This is much more efficient for the implementation in SummationByPartsOperators.jl. In Julia v1.6, this is also more efficient for sparse matrices but less efficient for dense matrices (compared to the plain AoS approach with mul_aos! above).

println("Array of Structures (reinterpreted array)")
u_aos_r = reinterpret(reshape, Vec5{T}, u_aos_plain); du_aos_r = similar(u_aos_r)
@show D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r
mul!(du_aos_r, D_SBP, u_aos_r)
@show reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_SBP, $u_aos_r))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_sparse, $u_aos_r))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos_r, $D_full, $u_aos_r))
Array of Structures (reinterpreted array)
D_SBP * u_aos_r ≈ D_sparse * u_aos_r ≈ D_full * u_aos_r = true
reinterpret(reshape, T, du_aos_r) ≈ du_aos_plain = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 570 evaluations.
 Range (minmax):  205.558 ns306.765 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     208.089 ns                GC (median):    0.00%
 Time  (mean ± σ):   209.877 ns ±   4.802 ns   GC (mean ± σ):  0.00% ± 0.00%

        ▅█                                                
  ▂▂▂▂▄▇███▇▅▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▁▂▁▂▁▂▂▁▂▁▂▂▂▂▂▂▂▂▃▃▃▃▃▃▃▂▂▂▂▂ ▃
  206 ns           Histogram: frequency by time          224 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 187 evaluations.
 Range (minmax):  548.080 ns823.674 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     560.567 ns                GC (median):    0.00%
 Time  (mean ± σ):   565.598 ns ±  16.284 ns   GC (mean ± σ):  0.00% ± 0.00%

         ▂▅▇█▆                                              
  ▁▁▁▂▃▄████████▆▄▃▃▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▃▃▃▃▂▂▂▁▁▁▁▁▁▁▁ ▂
  548 ns           Histogram: frequency by time          613 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (minmax):  7.301 μs 15.289 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.354 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.434 μs ± 441.612 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▆                                                   ▁    ▁
  ██▅▁██▁▃▄▁▁▁▃▄▆▅▅▁▄▃▁▁▁▁▁▃▁▁▃▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▅▆████▇ █
  7.3 μs       Histogram: log(frequency) by time      9.38 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Next, we still use an array of structures (AoS), but copy the data into a plain Array instead of using the reinterpreted versions. There is no significant difference to the previous version in this case.

println("Array of Structures")
u_aos = Array(u_aos_r); du_aos = similar(u_aos)
@show D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos
mul!(du_aos, D_SBP, u_aos)
@show du_aos ≈ du_aos_r
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_SBP, $u_aos))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_sparse, $u_aos))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_aos, $D_full, $u_aos))
Array of Structures
D_SBP * u_aos ≈ D_sparse * u_aos ≈ D_full * u_aos = true
du_aos ≈ du_aos_r = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 565 evaluations.
 Range (minmax):  206.650 ns290.506 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     210.942 ns                GC (median):    0.00%
 Time  (mean ± σ):   213.077 ns ±   6.623 ns   GC (mean ± σ):  0.00% ± 0.00%

    ▁▃▅▇█▆▄▁             ▂▃▄▄▂▁                               ▂
  ▅██████████▇▅▄▆▅▄▄▃▄▅▆█████████▇█▇▆▅▃▅▅▅▄▄▄▄▃▄▅▁▄▁▃▃▄▄▄▅▁▄▆ █
  207 ns        Histogram: log(frequency) by time        245 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 158 evaluations.
 Range (minmax):  654.506 ns902.380 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     701.241 ns                GC (median):    0.00%
 Time  (mean ± σ):   704.982 ns ±  20.929 ns   GC (mean ± σ):  0.00% ± 0.00%

                      ▁▂▄▅█▃▂                                
  ▁▁▁▁▁▁▁▁▁▂▂▂▃▃▄▄▅▆▆▇███████▇▅▄▃▂▂▂▁▁▁▁▁▁▂▂▂▂▂▂▃▃▃▃▃▂▂▂▂▁▁▁ ▃
  655 ns           Histogram: frequency by time          767 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_full
BenchmarkTools.Trial: 10000 samples with 4 evaluations.
 Range (minmax):  7.243 μs 18.009 μs   GC (min … max): 0.00% … 0.00%
 Time  (median):     7.333 μs                GC (median):    0.00%
 Time  (mean ± σ):   7.431 μs ± 578.957 ns   GC (mean ± σ):  0.00% ± 0.00%

  ▅█▇▅                                               ▁▁▁    ▂
  ███████▆▄▁▃▄▅▆▆▅▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▃▁▁▁▁▃▃▁▁▃▁▁▁▁▄▆█████▇▅ █
  7.24 μs      Histogram: log(frequency) by time      9.44 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Finally, let's look at a structure of arrays (SoA). Interestingly, this is slower than the array of structures we used above. On Julia v1.6, the sparse matrix representation performs particularly bad in this case.

println("Structure of Arrays")
u_soa = StructArray(u_aos); du_soa = similar(u_soa)
@show D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa
mul!(du_soa, D_SBP, u_soa)
@show du_soa ≈ du_aos
println("D_SBP")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_SBP, $u_soa))
println("\nD_sparse")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_sparse, $u_soa))
println("\nD_full")
show(stdout, MIME"text/plain"(), @benchmark mul!($du_soa, $D_full, $u_soa))
Structure of Arrays
D_SBP * u_soa ≈ D_sparse * u_soa ≈ D_full * u_soa = true
du_soa ≈ du_aos = true
D_SBP
BenchmarkTools.Trial: 10000 samples with 446 evaluations.
 Range (minmax):  230.092 ns307.166 ns   GC (min … max): 0.00% … 0.00%
 Time  (median):     232.496 ns                GC (median):    0.00%
 Time  (mean ± σ):   234.547 ns ±   6.354 ns   GC (mean ± σ):  0.00% ± 0.00%

     ▄██                                                   
  ▂▃▇████▇▄▃▂▂▂▂▂▂▂▂▂▂▁▂▁▁▁▁▂▂▂▂▁▁▁▂▂▂▂▂▂▂▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  230 ns           Histogram: frequency by time          255 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
D_sparse
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  213.808 μs  7.925 ms   GC (min … max):  0.00% … 95.59%
 Time  (median):     223.175 μs                GC (median):     0.00%
 Time  (mean ± σ):   280.432 μs ± 454.783 μs   GC (mean ± σ):  10.35% ±  6.19%

  ▁▆█▅▆▃▂▁▁▁                              ▂▅▄▃▄▃▁              ▂
  ██████████▇█▇▇▇▇▇▆▆▆▆▅▅▆▆▆▅▄▄▄▄▄▄▃▄▄▃▁▁███████▇▇▇▆▅▅▁▃▃▅▄▄▅ █
  214 μs        Histogram: log(frequency) by time        412 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.
D_full
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  174.706 μs  7.530 ms   GC (min … max):  0.00% … 96.22%
 Time  (median):     182.841 μs                GC (median):     0.00%
 Time  (mean ± σ):   236.956 μs ± 451.834 μs   GC (mean ± σ):  12.19% ±  6.22%

    █                                                           
  ▃▇█▃▅▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▁▂▁▂▂▂▂▂▂▃▄▃▃▃▃▃▂▂ ▂
  175 μs           Histogram: frequency by time          328 μs <

 Memory estimate: 328.25 KiB, allocs estimate: 10504.

These results were obtained using the following versions.

using InteractiveUtils
versioninfo()

using Pkg
Pkg.status(["SummationByPartsOperators", "StaticArrays", "StructArrays"],
           mode=PKGMODE_MANIFEST)
Julia Version 1.6.7
Commit 3b76b25b64 (2022-07-19 15:11 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, generic)
Environment:
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager
      Status `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl/docs/Manifest.toml`
  [90137ffa] StaticArrays v1.9.7
  [09ab397b] StructArrays v0.6.18
  [9f78cca6] SummationByPartsOperators v0.5.70 `~/work/SummationByPartsOperators.jl/SummationByPartsOperators.jl`